Example Architectural Decision – Memory Reservation for Virtual Desktops

Problem Statement

In a VMware View (VDI) environment with a large number of virtual desktops, the potential Tier 1 storage requirement for vswap files (*.vswp) can make the solution less attractive from a ROI perspective and have a high upfront cost for storage. What can be done to minimize the storage requirements for the vswap file thus reducing the storage requirements for the VMware View (VDI) solution?

Assumptions

1. vSwap files are placed on Tier 1 shared storage with the Virtual machine (default setting)

Motivation

1. Minimize the storage requirements for the virtual desktop solution
2. Reduce the up front cost of storage for VDI
3. Ensure the VDI solution gets the fastest ROI possible without compromising performance

Architectural Decision

Set the VMware View Master Template with a 50% memory reservation so all VDI machines deployed have a 50% memory reservation

Justification

1. Setting 50% reservation reduces the storage requirement for vSwap by half
2. Setting only 50% ensures some memory overcommitment and transparent page sharing can still be achieved
3. Memory overcommitment is generally much lower than CPU overcommitment (around 1.5:1 for VDI)
4. Reserving 50% of a VDI machines RAM is cheaper than the equivalent shared storage
5. A memory reservation will generally provide increased performance for the VM
6. Reduces/Removes the requirement/benefit for a dedicated datastore for vSwap files
7. Transparent page sharing (TPS) will generally only give up to 30-35% memory savings

Implications

1. Less memory overcommitment will be achieved

Alternatives

1. Set a higher memory reservation  of 75% – This would further reduce the shared storage requirement while still allowing for 1.25:1 memory overcommitment
2. Set a 100% memory reservation – This would eliminate the vSwap file but prevent memory overcommitment
3. Set a lower memory reservation of 25% – This would not provide significant storage savings and as transparent page sharing generally only achieves upto 30-35% there would still be a sizable requirement for vSwap storage with minimal benefit
4. Create a dedicated datastore for vSwap files on lower Tier storage

 

Example Architectural Decision – Storage I/O control for IaaS solutions

Problem Statement

In vSphere clusters servicing IaaS or Cloud workloads where customers or departments have the ability to self provision virtual machines with varying storage I/O requirements, how can the cluster be configured to ensure the most consistent virtual machine performance from a storage perspective?

Assumptions

1. vSphere 5.1 or later (to support both VMFS and NFS datastores and SIOC Automatic latency threshold computation)

Motivation

1. Ensure consistent storage performance for all virtual machines
2. Prevent a single virtual machine preventing other virtual machines reasonable access to storage

Architectural Decision

Enable Storage I/O control for all datastores and leave the shares values at the default setting for all virtual machines.

Set Tier 1 storage congestion threshold to 10ms – eg: SSD or SAS 15k RPM
Set Tier 2 storage congestion threshold to 20ms – eg: 15k or 10k SAS
Set Tier 3 storage congestion threshold to 30ms – eg: 7.2k SATA

Justification

1. In a IaaS or Cloud environment, it is important to prevent intentional or unintentional DoS type attacks; Storage I/O control will prevent such activities by giving equal access to the storage for all virtual machines attempting concurrent access.
2. Ensure no virtual machine/s monopolize the available I/O of the underlying storage eg: The noisy neighbor issue
3. Storage I/O control ensures consistent access across all ESXi hosts with access to the datastore, not just a single host. This ensures equal I/O access across the environment, not just across a single ESXi host.
4. Tier 1 should maintain lower latency than lower Tier disk, as such, a lower congestion threshold is advisable to ensure optimal performance for virtual machines hosted on Tier 1
5. Virtual machines requiring significant I/O will not be significantly impacted by Storage I/O control (assuming the congestion threshold is reached) as other VMs requiring access to storage will be able to access storage (thanks to Storage I/O control) and complete any required I/O in a timely manner and once the I/O is completed, no longer impact performance at all.
6. Virtual Machine not accessing storage regularly will not impact the VMs accessing storage regularly as Storage I/O control only acts on VMs accessing storage concurrently.
7. Leaving VMs with the default share value decreases administrative overhead and prevents human error granting significantly higher (or lower) share values which may negatively impact performance for one or more VMs

Implications

1. When using Storage DRS with SIOC the Storage DRS I/O latency setting needs to be carefully considered. Setting these value below the SOIC values (assuming Manual latency values are set) is recommended to ensure Storage DRS can work towards evenly balancing the storage workload and improving overall performance & SIOC then can help ensure consistent performance by taking action when the congestion threshold is reached to minimize latency spikes.

Alternatives

1. For vSphere 5.1 environments use the “Automatic Latency Threshold” by selecting the “Percentage of Peak Throughput” and setting the percentage value to “90%”. This setting is designed to minimize the change of a misaligned congestion threshold being manually set, therefore potentially reducing the effectiveness of SIOC
2. Not enable Storage I/O control
3. Enable Storage I/O control and set higher than default share values on critical VMs

Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage

Problem Statement

With 10GB connections, the proposed ESXi hosts will have less physical connections, but more bandwidth per connection than a host with 1GB NICs. In this case, 4 x 10GB NICs needs to cater for all traffic (including IP storage) for the ESXi hosts.

The design needs to ensure all types of traffic have sufficient burst and sustained bandwidth without negatively impacting other types of traffic.

How can this be achieved?

Assumptions

1. No additional Network cards (1gb or 10gb) can be supports
2. vSphere 5.0 or later
3. 2 x 48 port 10GB and 2 x 48 port 1GB switches exist in the environment
4. ESXi host are 4 way servers with 512GB RAM which are expected to run large numbers of VMs with varying workloads
5. Multi-NIC vMotion is not required due to using 10Gb NICs

Motivation

1.When using bandwidth allocation, use “shares” instead of “limits,” as the former has greater flexibility for unused capacity redistribution.
2. Ensure IP Storage (NFS) performance is optimal
3.Ensure vMotion activities (including a host entering maintenance mode) can be performed in a timely manner without impact to IP Storage or Fault Tolerance
4. Fault tolerance is a latency-sensitive traffic flow, so it is recommended to always set the corresponding resource-pool shares to a reasonably high relative value in the case of custom shares.

Architectural Decision

Separate VMware infrastructure functions (VMKernel) from virtual machine network traffic by creating two (2) dvSwitches (each with 2 x 10GB connections), dvSwitch-Admin and dvSwitch-Data

Enable Network I/O control, and configure NFS and/or iSCSI traffic with a share value of 100 and vMotion & FT which will have share value of 25.

Configure the two (2) VMKernel’s for IP Storage on dvSwitch-Admin and set to be Active on one 10GB interface and Standby on the second.

Configure the VMKernel for vMotion on dvSwitch-Admin as Active on one interface and standby on the second and vice-versa for FT.

Configure all dvPortGroups for Virtual Machine data on dvSwitch-Data.

Justification

1. The share values were chosen to ensure storage traffic is not impacted as this can cause flow on effects for the environments performance. vMotion & FT are considered important, but during periods of contention, should not monopolize or impact IP storage traffic.
2. IP Storage is more critical to ongoing cluster and VM performance than vMotion or FT
3. IP storage requires higher priority than vMotion which is more of a burst activity and is not as critical to VM performance
4. Which a share value of 25,  Fault Tolerance still has ample bandwidth to support the maximum supported FT machines per host of 4 even during periods of contention
5. Which a share value of 25, vMotion still has ample bandwidth to support multiple concurrent vMotion’s during contention however performance should not be impacted on a day to day basis. With up to 8 vMotion’s supported as it is configured on a 10GB interface. (Limit of 4 on a 1GB interface)
6. The environment required 1GB switches to accommodate for various devices, such as Out of Band management & IP KVM devices, as such having ESXi management on 2 x 1GB ports was not adding significant cost to the solution

Implications

1. In the unlikely event of significant and ongoing contention, performance for vMotion and FT may affect the ability to perform the evacuation of a host in a timely manner. This may impact the ability to performance scheduled maintenance.

Alternatives

1. Use all 4 x 10Gb NICs on a single dvSwitch, and use “Active” and “Standby” to ensure traffic remained on a specified NIC unless there was a failure. Leverage Network I/O control similar to the above example to ensure minimal impact of contention

See Example VMware vNetworking Design for IP Storage for an overview of the vNetworking design described in this example.