Example Architectural Decision – Virtual Machine swap file location

Problem Statement

When using shared storage where deduplication is utilized along with an array level snapshot based backup solution, what can be done to minimize the wasted capacity of snapping transient files in backups and the CPU overhead on the storage controller having to attempt to deduplicate data which cannot be deduped?

Assumptions

1. Virtual machine memory reservations cannot be used to reduce the vswap file size

Motivation

1. Reduce the snapshot size for backups without impacting the ability to backup and restore
2. Minimize the overhead on the storage controller for deduplication processing
3. Optimize the vSphere / Storage solution for maximum performance

Architectural Decision

1. Configure the HA swap file policy to store the swap file in a datastore specified by the host.
2. Create a new datastore per cluster which is hosted on Tier 1 storage and ensure deduplication is disabled on that volume
3. Configure all Host’s within a cluster to use the same specified datastore for vswap files
4. Ensure the new datastore is not part of any backup job

Justification

1. With minimal added complexity, backup jobs now exclude the VM swap file which reduces the backup size by the total amount of vRAM assigned to the VMs within the environment
2. As the vswap file is recreated at start up, loosing this file has no consequence
3. Decreasing Tier 1 storage requirements
4. The storage controller will not waste CPU cycles attempting to dedupe data which will not dedupe
5. Setting high percentages of memory reservation may impact the overcommitment in the environment where specifying a datastore for vswap reduces overhead without any significant downside

Implications

1. A datastore will need to be created for swapfiles
2. HA will need to be configured to store the swap file in a datastore specified by the host
3. The host (via Host Profiles) will need to be configured to use a specified datastore for vswap
4. vMotion performance will not be impacted where a VM is vMotion’d between two hosts that do not have a common vswap datastore as one datastore per cluster will be used for vswap files
5. The datastore will need to be sized to take into account the total vRAM assigned to VMs within the cluster

Alternatives

1. Set Virtual machine memory reservations of 100% to eliminate the vswap file
2. Store the swap file in the same directory as the virtual machine and accept the overhead on backups & dedupe
3. Use multiple datastores for vSwap across the cluster and accept the impact on vMotion

 

Example Architectural Decision – DRS Automation Level

Problem Statement

What is the most suitable DRS automation level and migration threshold for a vSphere cluster running an IaaS offering with a self service portal w/ unpredictable workloads?

Assumptions

1. Workload types and size are unpredictable in a IaaS environment, workloads may vary greatly and without notice
2. The solution needs to be as automated as possible without introducing significant risk

Motivation

1. Prevent unnecessary vMotion migrations which will impact host & cluster performance
2.Ensure the cluster standard deviation is minimal
3. Reduce administrative overhead of reviewing and approving DRS recommendations

Alternatives

1.Use Fully automated and Migration threshold 1 – Apply priority 1 recommendations
2.Use Fully automated and Migration threshold 2- Apply priority 1 & 2 recommendations
3. Use Fully automated and Migration threshold 4- Apply priority 1,2,3 and 4 recommendations
4.Use Fully automated and Migration threshold 5- Apply priority 1,2,3,4 & 5 recommendations
5. Set DRS to manual and have a VMware administrator assess and apply recommendations

Justification

1. Prevent excessive vMotion migrations that do not provide significant benefit to cluster balance as the vMotion itself will use cluster and network resources
2. Ensure cluster remains in a reasonably load balanced state without resource being wasted on load balancing for minimal improvement
3. DRS is a low risk, proven technology which has been used in large production environments for many years
4. Setting DRS to manual would be a significant administrative overhead and introduce additional risk from human error
5. Setting a more aggressive DRS migration threshold would put an additional load on the cluster which will likely not result in significantly better balance

Architectural Decision

Use DRS in Fully Automated mode with setting “3” – Apply priority 1,2 and 3 recommendations

Implications

1. DRS will not move workloads via vMotion where only a moderate improvement to the cluster will be achieved
2. At times, including after performing updates (via VUM) of ESXi hosts the cluster may appear to be unevenly balanced as DRS may calculate minimal benefit from migrations. Setting DRS to “Use Fully automated and Migration threshold 5” for a short period of time following maintenance should result in a more evenly balanced cluster.

High Availability Admission Control Setting and Policy

High Availability Admission Control Setting and Policy

Problem Statement

In a self service IaaS cloud, the virtual machine numbers and compute requirements will likely vary significantly and without notice. The environment needs to achieve the maximum consolidation ratio possible without impacting the ability to provide redundancy at a minimum of

1. N+1 for clusters of up to 8 hosts
2. N+2 for clusters of >8 hosts but <=16
3. N+3 for clusters of >16 hosts but <=24
4. N+4 for clusters of >24 hosts but <=32

What is the most efficient HA admission control policy / setting and configuration for the vSphere cluster?

Assumptions

1. Virtual machine workloads will vary from small ie: 1 vCPU / 1GB RAM up to large VMs of >=8vCPU and >=64GB Ram
2. Redundancy is mandatory as per the problem statement
3. ESXi hosts can support the maximum VM size required by the offering
4. vSphere 5.0 or later is being used

Motivation

1. Ensure maximum consolidation ratios in the cluster
2. Ensure optimal compute resource utilization
3. Prevent HA overhead being increased by the potentially inefficient slot size based HA algorithms
4. Make maximum use of hardware investment

Alternatives

1. Use “Specify a fail over host”
2. Set “Host failure cluster tolerates” to 1, 2,3 or 4 depending on cluster size

Justification

1. Enabling admission control is critical to ensure the required level of availability
2. The admission control settings that rely on the Slot size based HA algorithms do not suit clusters with varying VM sizes
3. Percentage setting being rounded up adds minimal additional HA overhead and helps ensure performance in a HA event
4. Ensure maximum CPU scheduling efficiency by having all hosts within the cluster running virtual machines
5. Ensure optimal DRS flexibility by having all hosts within the cluster active to be able to run virtual machines

Architectural Decision

For the HA Admission control setting use “Enable – Do not power on virtual machines that violate availability constraints”

For the HA admission control policy use “Percentage of cluster resources reserved for HA” and set the percentage of cluster resources as per the below table.

Note: Percentage values that do not equate to a full number will be rounded up.

HAPercentages

Note: Check out this cool HA admission control percentage calculator by Samir Roshan of ThinkingLoudOnCloud

Implications

1. The Percentage of cluster resources reserved for HA uses VM level CPU and Memory reservations to calculate cluster capacity. If not reservations are set performance in the event of a failure may be impacted
2. The default Mhz reserved for HA is 32mhz per VM –  CPU Reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event
3. The default memory reserved for HA is 0MB + VM overhead – Memory reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event