Example Architectural Decision – Storage I/O control for IaaS solutions

Problem Statement

In vSphere clusters servicing IaaS or Cloud workloads where customers or departments have the ability to self provision virtual machines with varying storage I/O requirements, how can the cluster be configured to ensure the most consistent virtual machine performance from a storage perspective?

Assumptions

1. vSphere 5.1 or later (to support both VMFS and NFS datastores and SIOC Automatic latency threshold computation)

Motivation

1. Ensure consistent storage performance for all virtual machines
2. Prevent a single virtual machine preventing other virtual machines reasonable access to storage

Architectural Decision

Enable Storage I/O control for all datastores and leave the shares values at the default setting for all virtual machines.

Set Tier 1 storage congestion threshold to 10ms – eg: SSD or SAS 15k RPM
Set Tier 2 storage congestion threshold to 20ms – eg: 15k or 10k SAS
Set Tier 3 storage congestion threshold to 30ms – eg: 7.2k SATA

Justification

1. In a IaaS or Cloud environment, it is important to prevent intentional or unintentional DoS type attacks; Storage I/O control will prevent such activities by giving equal access to the storage for all virtual machines attempting concurrent access.
2. Ensure no virtual machine/s monopolize the available I/O of the underlying storage eg: The noisy neighbor issue
3. Storage I/O control ensures consistent access across all ESXi hosts with access to the datastore, not just a single host. This ensures equal I/O access across the environment, not just across a single ESXi host.
4. Tier 1 should maintain lower latency than lower Tier disk, as such, a lower congestion threshold is advisable to ensure optimal performance for virtual machines hosted on Tier 1
5. Virtual machines requiring significant I/O will not be significantly impacted by Storage I/O control (assuming the congestion threshold is reached) as other VMs requiring access to storage will be able to access storage (thanks to Storage I/O control) and complete any required I/O in a timely manner and once the I/O is completed, no longer impact performance at all.
6. Virtual Machine not accessing storage regularly will not impact the VMs accessing storage regularly as Storage I/O control only acts on VMs accessing storage concurrently.
7. Leaving VMs with the default share value decreases administrative overhead and prevents human error granting significantly higher (or lower) share values which may negatively impact performance for one or more VMs

Implications

1. When using Storage DRS with SIOC the Storage DRS I/O latency setting needs to be carefully considered. Setting these value below the SOIC values (assuming Manual latency values are set) is recommended to ensure Storage DRS can work towards evenly balancing the storage workload and improving overall performance & SIOC then can help ensure consistent performance by taking action when the congestion threshold is reached to minimize latency spikes.

Alternatives

1. For vSphere 5.1 environments use the “Automatic Latency Threshold” by selecting the “Percentage of Peak Throughput” and setting the percentage value to “90%”. This setting is designed to minimize the change of a misaligned congestion threshold being manually set, therefore potentially reducing the effectiveness of SIOC
2. Not enable Storage I/O control
3. Enable Storage I/O control and set higher than default share values on critical VMs

Example VMware vNetworking Design for IP Storage

On a regular basis, I am being asked how to configure vNetworking to support environments using IP Storage (NFS / iSCSI).

The short answer is, as always, it depends on your requirements, but the below is an example of a solution I designed in the past.

Requirements

1. Provide high performance and redundant access to the IP Storage (in this case it was NFS)
2. Ensure ESXi hosts could be evacuated in a timely manner for maintenance
3. Prevent significant impact to storage performance by vMotion / Fault Tolerance and Virtual machines traffic
4. Ensure high availability for ESXi Management / VMKernel and Virtual Machine network traffic

Constraints

1. Four (4) x 10GB NICs
2. Six (6) x 1Gb NICs (Two onboard NICs and a quad port NIC)

Note: So in my opinion the above NICs are hardly “constraining” but still important to mention.

Solution

Use a standard vSwitch (vSwitch0) for ESXi Management VMKernel. Configure vmNIC0 (Onboard NIC 1) and vmNIC2 (Quad Port NIC – port 1)

ESXi Management will be Active on vmNIC0 and vmNIC2 although it will only use one path at any given time.

Use a Distributed Virtual Switch (dvSwitch-admin) for IP Storage , vMotion and Fault Tolerance.

Configure vmNIC6 (10Gb Virtual Fabric Adapter NIC 1 Port 1) and vmNIC9 (10Gb Virtual Fabric Adapter NIC 2 Port 2)

Configure Network I/O with NFS traffic having a share value of 100 and vMotion & FT will each have share value of 25

Each VMKernel for NFS will be active on one NIC and standby on the other.

vMotion will be Active on vmNIC6 and Standby on vmNIC9 and Fault Tolerance vice versa.

vNetworking Example dvSwitch-Admin

Use a Distributed Virtual Switch (dvSwitch-data) for Virtual Machine traffic

Configure vmNIC7 (10Gb Virtual Fabric Adapter NIC 1 Port 2) and vmNIC8 (10Gb Virtual Fabric Adapter NIC 2 Port 1)

Conclusion

While there are many ways to configure vNetworking, and there may be more efficient ways to achieve the requirements set out in this example, I believe the above configuration achieves all the customer requirements.

For example, it provides high performance and redundant access to the IP Storage by using two (2)  VMKernel’s each active on one 10Gb NIC.

IP storage will not be significantly impacted during periods of contention as Network I/O control will ensure in the event of contention that the IP Storage traffic has ~66% of the available bandwidth.

ESXi hosts will be able to be evacuated in a timely manner for maintenance as

1. vMotion is active on a 10Gb NIC, thus supporting the maximum 8 concurrent vMotion’s
2. In the event of contention, worst case scenario vMotion will receive just short of 2GB of bandwidth. (~1750Mb/sec)

High availability is ensured as each vSwitch and dvSwitch has two (2) connections from physically different NICs and connect to physically separate switches.

Hopefully you have found this example helpful and for a example Architectural Decision see Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage