Example Architectural Decision – Supporting VMware View Infrastructure Servers

Problem Statement

When designing a VMware View environment, there are numerous management virtual machines which are required to run the environment, including but not limited to Domain Controllers, vCenter , VUM , View Connection Brokers , View Security Servers, View Transfer servers , View Composer. These servers are typically heavily utilized in larger View deployments and in the event of compute or storage contention, would likely impact the performance of the Virtual Desktop Infrastructure, especially where View Composer or virtual desktop power or provisioning operations are frequent.

How can the VDI environment be designed so management servers have a consistent high level of performance and ensure that high consolidation ratios can be achieved for desktops whilst maintaining a consistent end user experience?

Assumptions

1.  One or more VMware View “Blocks”
2. ~2000 Users per Block
3. Using VMware View Linked Clones
4. Target overcommitment for Virtual desktops vCPU is >=6:1 – This is a conservative overcommitment ratio, >10:1 can be achieved
5. Target overcommitment for Virtual desktops vRAM is >=1.5:1 – This is a reasonable overcommitment ratio,  although higher can be achieved
6. vSphere 4.1 or later
7. VMware View 4.5 or later
8. ESXi Hosts are large enough to support >200 users each (eg: At least 2 way / 256GB assuming 1vCPU/1GB RAM VDI VMs)
9. An existing vSphere cluster supporting server workloads is not available or is at or near capacity
10. Antivirus has been optimized for Virtual desktop environments, such as vShield Endpoint to offload AV scanning to the hypervisor

Motivation

1.  Ensure consistent & optimal performance for Virtual desktops and VMware View Infrastructure VMs
2. Achieve the best ROI for the solution

Architectural Decision

Create a three (3) node “Management Cluster” with a scale out approach using 2 Way servers (as opposed to Four way servers like the VMware View Blocks) to ensure lower HA overhead (33% for N+1) and higher DRS efficiency than a two (2) node cluster. Have management virtual machines use different underlying storage, being either dedicated RAID packs or aggregates or for a large environments, storage controllers. Have a vCenter dedicated to running the Management infrastructure.

Justification

1.  The CPU overcommitment ratio for Virtual desktops is generally much higher than for server workloads
2. Server workloads are less tolerant to high CPU overcommitment ratios than virtual desktops
3. CPU contention (a.k.a CPU Ready) will likely have significant impact on infrastructure VMs
4. If Management VMs we’re hosted within the VMware View Blocks, the overcommitment would have to be lower to enable adequate performance, thus reducing the ROI for the solution
5. Server and desktop workloads have very different compute and storage profiles and generally are not good candidates to share the same ESXi host or cluster
6. During VMware View Linked Clone deployments, or maintenance activities such as a “recompose”of one or more Pools, Management VMs such as vCenter and View Composer should have minimal or no compute contention to ensure timely completion of maintenance. This does not fit well in a cluster with >6:1 CPU overcommitment.
7. Having a management cluster minimizes or removes the requirement for complexity/overheads of setting CPU or Memory reservations in an attempt to ensure performance for management VMs competing for compute resources with virtual desktops. (See “Common Mistake – Using CPU reservations to solve CPU ready” for more information)
8. Maximize the efficiency of the CPU scheduler, as the majority of Virtual Desktops should be 1vCPU as compared to management VMs such as vCenter / SQL / Connection brokers which will likely be 2 and 4 vCPU. Scheduling VMs with higher vCPU numbers on an environment with >6:1 vCPU overcommitment is unlikely to result in acceptable performance for the management virtual machines.
9. Having a cluster/s dedicated to desktops will give more flexibility to use features such as Distributed Power Management (DPM) for VMware View Blocks which will help achieve a faster ROI
10. vCenter’s workload with virtual desktops is generally higher (compared to vCenter servers managing server workloads) due to increased frequency of things like power operations and provisioning operations from View Composer. One (1) vCenter should be used per Block, or up to 2000 users.
11. In the event of performance/stability issues in the View Block/s, if the management servers shared the cluster, the ability for vSphere/View administrators to access management servers will likely be impacted, which may delay the troubleshooting process and eventual resolution of the issue/s
12. Having a separate management cluster with dedicated storage (RAID packs/aggregates and/or storage controllers) prevents the IO load of the View Desktops impacting the ability to manage the environment, especially during recompose and provisioning operations.

Implications

1.  Hardware will be required for the Management cluster – Although as the ESXi hosts in View Blocks (as they wont be hosting management workloads) should as a result achieve higher consolidation ratios which should close to if not entirely neutralize the cost of the Management Host Hardware
2. The storage solution will need to provide storage for Management virtual machines which is separate to Virtual desktops
3. The scale out approach for the management cluster may not achieve as higher memory savings form transparent page sharing due to having less virtual machines per host
4. Having an additional cluster is an additional administrative overhead, albeit minimal however this should reduce the risk in the environment leading to lower BAU effort/costs.

Alternatives

1. Run Management VMs in VMware View Blocks (with desktop workloads). – Not recommended
2. Run management VMs in an existing vSphere cluster running server workloads (if available)

A special Thanks to Michael Webster (VCDX#66) for his contribution to this example Architectural decision.

Example Architectural Decision – Securing vMotion & Fault Tolerance Traffic in IaaS/Cloud Environments

Problem Statement

vMotion and Fault tolerance logging traffic is unencrypted and anyone with access to the same VLAN/network could potentially view and/or compromise this traffic. How can the environment be made as secure as possible to ensure security between in a multi-tenant/multi-department environment?

Assumptions

1.  vMotion and FT is required in the vSphere cluster/s (although FT is currently not supported for VMs hosted with vCloud Director)
2. IP Storage is being used and vNetworking has 2 x 10GB for non Virtual Machine traffic such as VMKernel’s & 2 x 10GB NICs are available for Virtual Machine traffic (Similar to Example vNetworking Design for IP Storage)
3. VI3 or later

Motivation

1. Ensure maximum security and performance for vMotion and FT traffic
2. Prevent vMotion and/or FT traffic impacting production virtual machines

Architectural Decision

vMotion & Fault tolerance logging traffic will each have a dedicated non routable VLAN which will be hosted on a dvSwitch which is physically separate from virtual machine distributed virtual switch.

Justification

1.  vMotion / FT traffic does not require external (or public) access
2. A VLAN per function ensures maximum security / performance with minimal design / implementation overhead
3. Prevent vMotion and/or FT traffic potentially impacting production virtual machine and vice versa by having the traffic share one or more broadcast domain/s
4. Ensure vMotion/FT traffic cannot leave there respective dedicated VLAN/s and potentially be sniffed

Implications

1. Two (2) VLANs with private IP ranges are required to be presented over 802.1q connections to the appropriate pNICs

Alternatives

1.  vMotion / FT share the ESXi management VLAN – This would increase risk of traffic being intercepted and “sniffed”
2. vMotion / FT share a dvSwitch with Virtual Machine networks while still running within dedicated non routable VLANs over 802.1q

Example Architectural Decision – Distributed Power Management (DPM) for Virtual Desktop Clusters

Problem Statement

In a VMware View (VDI) environment where the bulk of the workforce work between 8am and 6pm daily, how can vSphere be configured to minimize the power consumption without significant impact to the end user experience?

Assumptions

1. The bulk of the workforce work between 8am and 6pm daily
2. Most users login during a 2 hour window between 7:30 and 9:30 daily
3. Most users logoff during a 2 hour window between 4:30 and 6:30 daily
4. VMware View cluster maintains at least N+1 redundancy
5. VMware View cluster only runs desktop workloads
6. VMware View cluster size is >=5
7. VMware View cluster/s are configured with HA admission control policy of “Percentage of cluster resources reserved for HA” to avoid the potentially inefficient slot size calculation preventing hosts going into standby mode

Motivation

1. Reduce the power consumption
2. Align with Green IT strategies
3. Reduce the datacenter costs
4. Reduce the carbon footprint

Architectural Decision

Configure and enable DPM on all ESXi hosts with the power management set to “Automatic” and the DPM threshold set to “Apply priority 3 or higher recommendations” and set hosts 1,2 and 3 in the cluster not to enter standby mode.

Justification

1. As the bulk of the users are inactive outside of normal business hours, a significant power saving can be achieved
2. The users do not all login at once, which allows DPM to gradually start ESXi hosts (which were put into standby mode by DPM previously)
3. In the event the workload is unusually low on a given day, power savings can be realized without significant impact to the end user experience
4. Where a large number of users login unexpectedly early one morning, the impact to users will be minimal
5. DPM is configured to ensure a minimum of three (3)  ESXi hosts remain on at all times. This number is expected to be able to support all desktops within the environment under low load (ie: 80% of desktops at idle). This number can be adjusted if required.

Implications

1. In the unlikely event a large number of users logon unexpectedly early one morning, the impact to users may be experienced for the time it takes for one or more ESXi hosts to exit maintenance mode. This is generally <10mins for most servers.
2. Out of band interfaces such as DRAC / iLO / RSA or IMM interfaces (depending on host hardware type) will need to be configured and be accessible to vCenter and the ESXi hosts to enable DPM to function
3. As the “Percentage of cluster resources reserved for HA” setting is static (not dynamically adjusted by DPM) in the event of a host failure while one or more hosts are in standby mode, in unlikely event a VM attempts to power on before a host has been able to successful exit standby mode, the VM may fail to power on.
4. Where large percentages of Memory reservations are used (see Example AD – Memory Reservation for VDI) then ability for the for DPM to put one or more hosts into standby will be reduced. Where DPM is expected to be used, no more than 50% memory reservation should be configured to ensure maximum  memory overcommitment can be achieved without placing a significant overhead on the shared storage for vSwap files
5. Monitoring solutions may need to be customized/modified not to trigger an alarm for a host that is put into standby mode

Alternatives

1. Set a lower number of hosts to remain on to maximize power savings – This may result in higher impact to users first thing in the morning in the event of high concurrent logins
2. Set a higher number of host to remain on, however this will minimize power savings and give less value to the added complexity of setting up DPM (and associated out of band management interfaces)
3. Set the DPM threshold more aggressive to maximize power savings – This would likely result in some impact to VMs due to increased physical cores being available to the CPU scheduler and physical memory being available for VMs which may result in swapping