High Availability Admission Control Setting and Policy
Problem Statement
In a self service IaaS cloud, the virtual machine numbers and compute requirements will likely vary significantly and without notice. The environment needs to achieve the maximum consolidation ratio possible without impacting the ability to provide redundancy at a minimum of
1. N+1 for clusters of up to 8 hosts
2. N+2 for clusters of >8 hosts but <=16
3. N+3 for clusters of >16 hosts but <=24
4. N+4 for clusters of >24 hosts but <=32
What is the most efficient HA admission control policy / setting and configuration for the vSphere cluster?
Assumptions
1. Virtual machine workloads will vary from small ie: 1 vCPU / 1GB RAM up to large VMs of >=8vCPU and >=64GB Ram
2. Redundancy is mandatory as per the problem statement
3. ESXi hosts can support the maximum VM size required by the offering
4. vSphere 5.0 or later is being used
Motivation
1. Ensure maximum consolidation ratios in the cluster
2. Ensure optimal compute resource utilization
3. Prevent HA overhead being increased by the potentially inefficient slot size based HA algorithms
4. Make maximum use of hardware investment
Alternatives
1. Use “Specify a fail over host”
2. Set “Host failure cluster tolerates” to 1, 2,3 or 4 depending on cluster size
Justification
1. Enabling admission control is critical to ensure the required level of availability
2. The admission control settings that rely on the Slot size based HA algorithms do not suit clusters with varying VM sizes
3. Percentage setting being rounded up adds minimal additional HA overhead and helps ensure performance in a HA event
4. Ensure maximum CPU scheduling efficiency by having all hosts within the cluster running virtual machines
5. Ensure optimal DRS flexibility by having all hosts within the cluster active to be able to run virtual machines
Architectural Decision
For the HA Admission control setting use “Enable – Do not power on virtual machines that violate availability constraints”
For the HA admission control policy use “Percentage of cluster resources reserved for HA” and set the percentage of cluster resources as per the below table.
Note: Percentage values that do not equate to a full number will be rounded up.
Note: Check out this cool HA admission control percentage calculator by Samir Roshan of ThinkingLoudOnCloud
Implications
1. The Percentage of cluster resources reserved for HA uses VM level CPU and Memory reservations to calculate cluster capacity. If not reservations are set performance in the event of a failure may be impacted
2. The default Mhz reserved for HA is 32mhz per VM – CPU Reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event
3. The default memory reserved for HA is 0MB + VM overhead – Memory reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event
Talking of setting reservations for critical machines I had a large customer saying about setting these reservations to ensure their failover capacity was based on their actual usage and so i asked Frank Denneman about it as managing these reservations in huge environments is a fair amount of added admin and it prompted him to write this posting if you/other people are interested http://frankdenneman.nl/2012/10/17/ha-admission-control-is-not-a-capacity-management-tool/