Example Architectural Decision – vSphere configuration for handling APD/PDL scenarios

Problem Statement

What is the best way to configure the vSphere environment to handle All Paths Down (APD) and Permanent Device Loss (PDL) situations where the environment uses Active/Active (IBM SVC) storage with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric?

Requirements

1. Ensure in the event of storage issues the impact to the vSphere environment is minimized.
2. Where possible have the environment automatically respond in the event of storage problems

Assumptions

1. vSphere 5.1 or later
2. The Storage Area Network (SAN)  fabric is highly available (>99.999% availability)
3. All storage is FC (block) based via an Active/Active Disk array (IBM SVC disk system)
4. All ESXi hosts have storage connectivity via multiple HBAs
5. All ESXi hosts are connected to two (2) physically separate FC switches
6. The Path Selection Plugin (PSP) being used is “VMW_PSP_RR” (Round Robin)

Constraints

1. None

Motivation

1. Minimize impact of APD and PDL situations

Architectural Decision

Configure the following advanced settings

Set “Misc.APDHandlingEnable” to 1 (0 is default which is Disabled)
Set “Misc.APDTimeout” to 20 (140 seconds is default)

Set “disk.terminateVMOnPDLDefault” to 1 (Enabled)
Set “das.maskCleanShutdownEnabled” to 1 (Enabled)

Justification

1. The storage array (IBM SVC) operates in an Active/Active manor where the Path Selection Plugin (PSP) is either “VMW_PSP_RR” (Round Robin), “VMW_PSP_MRU” (Most Recently Used) OR “VMW_PSP_FIXED_AP” (Note: Now included in VMW_PSP_FIXED in vSphere 5.1), in the event of one or more path failures, the PSP will handle this event and use a working path. Where an APD situation occurs in a highly available SAN fabric it is likely the issue is a catastrophic failure and it is ideal to terminate I/O as soon as possible. As such lowering the “Misc.APDTimeout” to 20 (minimum) allows for a short outage but does not allow the VM to continue attempting I/O where it cannot be committed to disk.

2. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect” to prevent “hostd” worker threads being exhausted and causing the “hostd” service to become hung thus increasing resiliency at the ESXi layer.

3. In the event not all hosts in the cluster are impacted by the PDL, HA can detect the PDL on one (or more) hosts and restart the virtual machines on one of the hosts in the cluster which do not have the PDL state on the datastore/s

  • 4. Having “disk.terminateVMOnPDLDefault” enabled , ensures VMs are shutdown in a PDL event
  • 5.

  • The “das.maskCleanShutdownEnabled” setting allows VMs shutdown as a result of a PDL to be automatically restarted by HA

5. Setting the Misc.APDTimeout to “20” does not impact the storage connectivity even in the event of a single SVC cluster node failing as all Storage is Active on all SVC cluster nodes. Note: Half the paths would be lost in the event of a failed SVC cluster node but this does not constitute an APD situation.

Alternatives

1. Leave “Misc.APDHandlingEnable” at 0 (default)
2. Leave “Misc.APDTimeout” at 140 (default) OR set a higher or lower value (20 Min / 99999 Max)
3. Set “das.maskCleanShutdownEnabled” to Disabled
4. Set “disk.terminateVMOnPDLDefault” to 0 (Disabled)
5. Various combinations of the above

Implications

1. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect”., in the unlikely event of an outage lasting >20 seconds manual intervention will be required.
2. In the event of APD situation, Virtual machines will not be restarted by HA even where other ESXi hosts are not impacted by the APD situation
3. Due to the nature of an APD situation, there is no clean way to recover. Once the issue is resolved at the SAN fabric or disk system layer, ESXi hosts may need to be rebooted.

Related Articles

1. Advanced Configuration options for VMware High Availability in vSphere 5.0 and 5.1 (2033250)

CloudXClogo

 

 

Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo

 

 

Example Architectural Decision – Number of paths per LUN for VMFS datastores

Problem Statement

In a vSphere environment hosting a large number of VMs,  Virtual machines I/O requirements range from small <100 IOPS to large business critical applications with tens of thousands of IOPS, the ESXi hosts have been configured with 4 x 8Gb FC HBAs.

What is the most suitable number of paths per LUN when using 4 x 8GB FC connections per Host, and how will they be presented in a highly available manner with two (2) SAN Fabrics connected to an Active/Active Enterprise Disk array?

Requirements

1. All LUNs are available on all FC Interfaces
2. The storage be highly available
3. The environment should be able to continue running production workloads in the unlikely event of a dual port HBA, or single Fabric failure.
4. The environment maintain a consistent level of performance

Assumptions

1. The Storage area network has two (2) fabrics each of which is highly available
2. The disk system is presented to both SAN fabrics
3. The number of VMs per host is >100
4. vSphere 4.0 or later
5. Storage array is Active/Active
6. ESXi hosts are large and are designed to drive significant I/O
7. VAAI is supported and enabled

Constraints

1. Maximum paths supported per ESXi host is 1024
2. Maximum number of datastores per ESXi host is 256

Motivation

1. Ensure optimal performance redundancy
2. Maximum the total capacity able to be presented to a cluster

Architectural Decision

Use a standard of 8 paths per LUN

Each LUN will be presented to each HBA via both Controller A and Controller B resulting in two paths per LUN per HBA.

With a total of 4 FC connections across two (2) physical dual port HBAs in a HA configuration with one (1) connection per HBA per Fabric, this equates to a total of 8 paths per LUN to the ESXi host (4 paths per Fabric)

Justification

1. This equates to 4 paths (1 per HBA interface per LUN) per Fabric
2. The use of VMware NMP with “Round Robin” will be used and having all LUNs presented via both fabrics and all HBAs will provide the maximum reducing in latency and the most consistent performance overall
3. 8 paths per LUN ensures up to 128 LUNs can be presented within the 1024 paths per ESXi host limit which will support sufficient capacity for the cluster
4. The solution is highly available as it uses two fabrics and both controllers are Active
5. In the event of a Fabric failure, the remaining Fabric serving 2 x 8Gb connections will provide connectivity to both Controller A and B, with a total of 4 paths
6. Ensures the cluster can have enough LUNs to balance workloads across which will assist keeping latency at a minimum

Alternatives

1. Have less paths per LUN which enabled the use of more LUNs
2. Have more paths per LUN and have less LUNs

Implications

1. LUN sizes will need to be sizes to ensure a maximum of 128 LUNs are sufficient from a capacity perspective to cater for the desired number of virtual machines

vmware_logo_ads