Example Architectural Decision – vSphere 5.1 Single Sign On (SSO) deployment mode across Active/Active Datacenters

Problem Statement

What is the most suitable deployment mode for vCenter Single-Sign On (SSO) in an environment where there are two (2) physical datacenters running in an Active/Active configuration?

Requirements

1. The solution must be a fully supported configuration
2. Meet/Exceed RTO of 4 hours
3. Environment must support SRM failover between Datacenter A and Datacenter B where an entire datacenter is lost

Assumptions

1.Three (3) vCenter servers will be used, One (1) at Datacenter A and Two (2) at Datacenter B
2. Environment has Two (2) Production clusters (One per Datacenter), and One (1) vCloud Cluster at Datacenter B each with a dedicated vCenter
3. Stretched clusters are not used
4. All vSphere Infrastructure servers (including SSO) are protected by SRM and vSphere HA
5. Inter-site Metropolitan Area Network is high bandwidth (>10Gb) , low latency (<5ms) and highly available (99.999%)
6. The average number of authentications per second for each SSO instance is <30 (Configuration Maximum)

Constraints

1. The environment uses traditional agent based backup solution which may not meet RPO/RTO requirements

Motivation

1. Future proof the environment

Architectural Decision

1. Use “Multi-site” SSO deployment mode
2. Do not use SSO “High Availability” clusters
3. The Primary SSO server will be at Datacenter B
4. The remaining vCenter servers will be “Secondaries” and point to the Datacenter B Primary SSO instance
5. The each SSO instance will be on a dedicated Windows 2008 x64 R2 instance
6. Each SSO instance will use the bundled SQL database
7. (Optional) For greater availability , vCenter Heartbeat will be used to protect each SSO instance

Justification

1. The environment is being designed (where) possible to sustain a Metropolitan Area Network failure between the two (2) datacenters

2. If “High Availability” mode is used, at least one (1) vCenter would be accessing SSO across the MAN link which introduces an unnecessary dependency on the MAN links

3. “High Availability” currently requires manual intervention which can be complicated and problematic

4. “Basic” mode prevents the use of Linked Mode which will make management of the environment more difficult

5. Using Multisite mode allows faster access to authentication services as each SSO instance is configured with Active Directory servers located at the same datacenter.

6. Multisite mode is required for the use of Linked-Mode and Linked Mode will  make day to day management easier

7. If one instance SSO goes offline for any reason, this will not impact production virtual machines. It will simply prevent any authentication to the affected vCenter server.

8. Having the SSO Primary at Datacenter B ensures only traffic from one vCenter (Datacenter A vCenter) traverses the MAN link as the third vCenter (for vCloud Director) is at Datacenter B

9. In the event of Datacenter B having a full datacenter wide failure for any reason, the Primary SSO instance being offline will not impact the management of Datacenter A OR the ability for the environment being recovered by SRM.

10. During an SSO upgrade, multiple vCenter’s cannot co-exist and using a centralized (or shared) SSO instance would overly complicate the upgrade process and lead to extended impact to the vSphere environments.

Alternatives

1. Use “Basic” Mode, resulting in a standalone version of SSO for each vCenter server

2. Use “High Availability Cluster” (Shared the same SSO database and identity sources) with one SSO server per physical datacenter

3. Use “Multisite” deployment with “High Availability Clusters” per datacenter

4. Host SSO database on a SQL Server

5. Run SSO on the vCenter server with or without the SSO database locally

6. Run a single SSO instance shared by all three (3) vCenters and use vCenter Heartbeat running across the MAN to protect SSO

Implications

1. Without a “High Availability Cluster” or SSO being protected by vCenter Heartbeat at each datacenter, the SSO for each site is a Single point of failure where authentication to the affected vCenter will fail

2. In the event of one (1) SSO server failing at Datacenter A, the SSO role does not failover to Datacenter B, or vice versa. In this case, All authentication requests on the site where SSO has failed, will fail.

3. Requires the installable version of SSO, which is Windows Only. The use of the vCenter Server Appliance (VCSA) is not available.

4. Additional Windows 2008 licenses are required for the SSO servers

Related Articles

1. Disabling Single Sign On – Dont Do It! – LongWhiteClouds

2. vSphere 5.1 Single Sign On (SSO) Configuration – Architectural Decision flowchart

I would like to Thank Michael Webster VCDX#66 (@vcdxnz001) for his contribution to this example architectural decision.

CloudXClogo

 

 

Example Architectural Decision – Host Isolation Response for FC Based storage

Problem Statement

What are the most suitable HA / host isolation settings where the environment uses Storage (IBM SVC) with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric where ESXi Management and Virtual Machine traffic run over a highly available data network?

Requirements

1. Ensure in the event of one or more hosts becoming isolated, the environment responds in an automated manner to recover VMs where possible

Assumptions

1.The Network is highly available (>99.999% availability)
2. The Storage is highly available (>99.999% availability)
3. vSphere 5.0 or later
4. ESXi hosts are connected to the network via two physical separate switches via two physical NICs

Constraints

1. FC (Block) based storage

Motivation

1. Meet/Exceed availability requirements
2. Minimize the chance of a false positive isolation event

Architectural Decision

Turn off the default isolation address by setting the below advanced setting

“das.usedefaultisolationaddress” = False

Configure three (3) isolation addresses by setting the below advanced settings

“das.isolationaddress1″ = 192.168.1.1 (Core Router)

“das.isolationaddress2″ = 192.168.1.2 (Core Switch 1 )

“das.isolationaddress3″ = 192.168.1.3 (Core Switch 2 )

Configure Datastore Heartbeating with “Select any of the clusters datastores”

Configure Host Isolation Response to: “Shutdown”

Justification

1. When using FC storage, it is possible for the Management and Virtual Machine Networks to be unavailable, while the Storage network is working perfectly. In this case Virtual machines may not be able to communicate to other servers, but can continuing reading/writing from disk. In this case, they will likely not be servicing customer workloads, as such, Shutting the VM down gracefully allows HA to restart the VM/s on host/s which are not isolated gives the VM a greater chance of being able to resume servicing workloads than remaining on an isolated host.
2. Datastore heartbeating will allow HA to confirm if the host is “isolated” or “failed”. In either case, Shutting down the VM will allow HA to recover the VM on a surviving host.
3. As all storage is presented via Active/Active IBM SVC controllers, there is no benefit is specifying specific datastores to be used for heartbeating
4. The selected isolation addresses were chosen as they are both highly available devices in the network which are essential for network communication and cover the core routing and switching components in the network.
5. In an environment where the Network is highly available an isolation event is extremely unlikely  as such, where the three (3) isolation addresses cannot be contacted, it is unlikely the network can be restored in a timely manner OR the host has suffered multiple concurrent failures (eg: Multiple Network Cards) and performing a controlled shutdown helps ensure when the network is recovered, the VMs are brought back up in a consistent state, OR in the event the isolation impacts only a subset of ESXi hosts in the cluster, the VM/s can be recovered by HA and resume normal operations.

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be Shutdown
2.  Using “Shutdown” as opposed to “Power off” ensures a graceful shutdown of the guest operating system, however this will delay the HA restart of the VM for up to 5 mins (300 seconds) if VMware Tools is unable to do a controlled shutdown, in which case after 300 seconds a “Power Off” will be executed.
3. In the unlikely event of network instability, VMs may be Shutdown prematurely.

CloudXClogo

 

 

Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo