Example Architectural Decision – Network I/O Control Shares/Limits for ESXi Host using IP Storage

Problem Statement

With 10GB connections becoming the norm, ESXi hosts will generally have less physical connections than in the past where 1Gb was generally used, but more bandwidth per connection (and in total) than a host with 1GB NICs.

In this case, the hosts have only to 2 x 10GB NICs and the design needs to cater for all traffic (including IP storage) for the ESXi hosts.

The design needs to ensure all types of traffic have sufficient burst and sustained bandwidth for all traffic types without significantly negatively impacting other types of traffic.

How can this be achieved?

Assumptions

1. No additional Network cards (1gb or 10gb) can be supported
2. vSphere 5.1
3. Multi-NIC vMotion is desired

Constraints

1. Two (2) x 10GB NICs

Motivation

1. Ensure IP Storage (NFS) performance is optimal
2.Ensure vMotion activities (including a host entering maintenance mode) can be performed in a timely manner without impact to IP Storage or Fault Tolerance
3. Fault tolerance is a latency-sensitive traffic flow, so it is recommended to always set the corresponding resource-pool shares to a reasonably high relative value in the case of custom shares.
4. Proactively address potential contention due to limited physical network interfaces

Architectural Decision

Use one dvSwitch to support all VMKernel and virtual machine network traffic.

Enable Network I/O control, and configure NFS and/or iSCSI traffic with a share value of 100 and ESXi Management , vMotion & FT which will have share value of 25. Virtual Machine traffic will have a share value of 50.

Configure the two (2) VMKernel’s for IP Storage on dvSwitch and set to be Active on one 10GB interface and Standby on the second.

Configure two VMKernel interfaces for vMotion on the dvSwitch and set the first as Active on one interface and standby on the second.

A single VMKernel will be configured for Fault tolerance and will be configured as Active on one interface and standby on the second.

For ESXi Management, the VMKernel will be configured as Active on the interface where FT is standby and standby on the second interface.

All dvPortGroups for Virtual machine traffic will be active on both interfaces.

Justification

1. The share values were chosen to ensure IP storage traffic is not impacted as this can cause flow on effects for the environments performance. vMotion & FT are considered important, but during periods of contention, should not monopolize or impact IP storage traffic.
2. IP Storage is more critical to ongoing cluster and VM performance than ESXi Management, vMotion or FT
3. IP storage requires higher priority than vMotion which is more of a burst activity and is not as critical to VM performance
4. With a share value of 25,  Fault Tolerance still has ample bandwidth to support the maximum supported FT machines per host of 4 even during periods of contention
5. With a share value of 25, vMotion still has ample bandwidth to support multiple concurrent vMotion’s during contention however performance should not be impacted on a day to day basis. With up to 8 vMotion’s supported as it is configured on a 10GB interface. (Limit of 4 on a 1GB interface) Where no contention exists, vMotion traffic can burst and use a large percentage of both 10GB interfaces to complete vMotion activity as fast as possible
6. With a share value of 25,  ESXi Management still has ample bandwidth to continue normal operations even during periods of contention
7. When using bandwidth allocation, use “shares” instead of “limits,” as the former has greater flexibility for unused capacity redistribution.
8. With a share value of 50,  Virtual machine traffic still has ample bandwidth and should result in minimal or no impact to VM performance across 10Gb NICs
9. Setting Limits may prevent operations from completing in a timely manner where there is no contention

Implications

1. In the unlikely event of significant and ongoing contention, performance for vMotion may affect the ability to perform the evacuation of a host in a timely manner. This may extend scheduled maintenance windows.
2. VMs protected by FT may be impacted

Alternatives

1. Use a share value  of 50 for IP storage traffic to more evenly share bandwidth during periods of contention. However this may impact VM performance eg: Increased CPU WAIT if the IP storage is not keeping up with the storage demand

Related Posts
1. Example VMware vNetworking Design for IP Storage (4 x 10GB NICs)
2. Example VMware vNetworking Design for IP Storage (2 x 100GB NICs)
3. Frank Denneman (VCDX) – Designing your vMotion Network – Multi-NIC vMotion & NIOC

Example Architectural Decision – Securing vMotion & Fault Tolerance Traffic in IaaS/Cloud Environments

Problem Statement

vMotion and Fault tolerance logging traffic is unencrypted and anyone with access to the same VLAN/network could potentially view and/or compromise this traffic. How can the environment be made as secure as possible to ensure security between in a multi-tenant/multi-department environment?

Assumptions

1.  vMotion and FT is required in the vSphere cluster/s (although FT is currently not supported for VMs hosted with vCloud Director)
2. IP Storage is being used and vNetworking has 2 x 10GB for non Virtual Machine traffic such as VMKernel’s & 2 x 10GB NICs are available for Virtual Machine traffic (Similar to Example vNetworking Design for IP Storage)
3. VI3 or later

Motivation

1. Ensure maximum security and performance for vMotion and FT traffic
2. Prevent vMotion and/or FT traffic impacting production virtual machines

Architectural Decision

vMotion & Fault tolerance logging traffic will each have a dedicated non routable VLAN which will be hosted on a dvSwitch which is physically separate from virtual machine distributed virtual switch.

Justification

1.  vMotion / FT traffic does not require external (or public) access
2. A VLAN per function ensures maximum security / performance with minimal design / implementation overhead
3. Prevent vMotion and/or FT traffic potentially impacting production virtual machine and vice versa by having the traffic share one or more broadcast domain/s
4. Ensure vMotion/FT traffic cannot leave there respective dedicated VLAN/s and potentially be sniffed

Implications

1. Two (2) VLANs with private IP ranges are required to be presented over 802.1q connections to the appropriate pNICs

Alternatives

1.  vMotion / FT share the ESXi management VLAN – This would increase risk of traffic being intercepted and “sniffed”
2. vMotion / FT share a dvSwitch with Virtual Machine networks while still running within dedicated non routable VLANs over 802.1q

Example Architectural Decision – Enhanced vMotion Compatiblity

Problem Statement

The virtual infrastructure is required to scale over time as demand for compute and/or availability increases.
When purchasing additional ESXi hosts over an expected ESXi host hardware life of >=3 year it is unlikely that the exact make/model of server or CPU type will be available. The solution needs to ensure full functionality across ESXi hosts (specifically vMotion) which may not be exactly the same hardware, although all processors will always be from the same vendor.

How can the vSphere cluster/s be configured for maximum flexibility without significant impact to Virtual machine performance?

Assumptions

1. All CPU types will be Intel or AMD but not a mix of the two
2. All CPUs will have a supported EVC mode

Motivation

1. Ensure full functionality between ESXi hosts whos Intel CPUs may not match exactly
2. Prevent having to purchase large volumes of identical hardware at one time
3. Allow vSphere clusters to be expanded over time using similar, but not identical hardware although maintaining the same CPU make.

Architectural Decision

Enable EVC and maintain it at the maximum supported EVC level for all ESXi hosts in each vSphere cluster.

Justification

1. vMotion is a requirement for the cluster/s to ensure maximum flexibility
2. It is essential to avoid downtime where possible. EVC ensures VMs can be vMotion’d to newer hosts for the purpose of expanding a cluster, OR alternatively, to newer hardware so older hardware can be decommissioned without impact to the VM.
3. The EVC level for the cluster can be increased without downtime
4. Having EVC disabled would require virtual machines being migrated to new hardware have downtime where CPU types are not similar
5. If EVC was not enabled, newer hardware may be placed into a new (smaller) cluster/s and this would add an unnecessary HA overhead as well as reduce the efficiency of DRS

Implications

1. Where the EVC level for a cluster is increased, virtual machines will not leverage new CPU features unmasked by EVC until the next reboot
2. In the event new hardware is added to a cluster and the new hardware is compatible with a higher EVC mode, a virtual machine which has a workload which can benefit from CPU features masked by the existing EVC mode may not perform at the optimal level until older hardware is removed from the cluster and the EVC mode increased.

Alternatives

1. Leave EVC disabled and where CPU types are not compatible to vMotion, shutdown the guest OS for migrations.