Example Architectural Decision – Site Recovery Manager Deployment Location

Problem Statement

To ensure Production vSphere environment/s can meet/exceed the required RTOs in the event of a declared site failure and easily perform scheduled DR testing, VMware Site Recovery Manager will be used to automated the failover to the secondary site.

What is the most suitable way to deploy Site Recovery Manager to ensure the environment can be maintained with minimal risk/complexity?

Requirements

1. Meet/Exceed RTO requirements
2. Ensure solution is fully supported

Assumptions

1. vCenter is considered a Tier 1 application
2. vSphere 5.1
3. SRM 5.1
4. A single Windows instance hosts vCenter, SSO and Inventory services and is protected by vCenter Heartbeat

Constraints

1. SRM is not protected by vCenter Heartbeat

Motivation

1. Reduce the complexity for BAU maintenance

Architectural Decision

Install Site Recovery Manager on a dedicated Windows 2008 instance

Justification

1. When installing / upgrading /  patching  SRM including Storage Replication Adapters (SRAs) this may require a reboot or troubleshooting which may impact the production vCenter, including SSO and inventory services.

2. Having SRM separate to vCenter ensures the fail over is not unnecessarily delayed in the event of a disaster due to contention with vCenter on the same VM

3. SRM and vCenter work together in the event of an outage, as such they are less complimentary workloads

4. If hosted on vCenter, SRM will then be subject to the same change windows and be impacted during any maintenance performed for applications running on the same OS instance

5. The SRM application has different availability requirements than vCenter, as such if SRM was combined with vCenter, SRM (having a lower availability requirement than vCenter) would have to be treated with the same change management / care as vCenter which would complicate BAU maintenance

6. The SRM service (business) has different maintenance requirements to vCenter, as such they are not suited to be placed on the same VM

7. Having SRM on a dedicated VM aligns with the scaling out recommendation for virtual workloads

8. Having additional components on the same OS increases complexity and may reduce the availability of vCenter

Alternatives

1. Place SRM on the vCenter server

Implications

1. One (1) additional Windows 2008 R2 licenses will be required

2. One (1) additional Windows instance will need to be maintained in BAU

I would like to Thank James Wirth VCDX#83 (@jimmywally81) for his contribution to this example architectural decision.

Related Articles

1. VMware Site Recovery Manager, Physical or Virtual machine?

2. Swap file location for SRM protected VMs

CloudXClogo

 

 

Example Architectural Decision – vSphere 5.1 Single Sign On (SSO) deployment mode across Active/Active Datacenters

Problem Statement

What is the most suitable deployment mode for vCenter Single-Sign On (SSO) in an environment where there are two (2) physical datacenters running in an Active/Active configuration?

Requirements

1. The solution must be a fully supported configuration
2. Meet/Exceed RTO of 4 hours
3. Environment must support SRM failover between Datacenter A and Datacenter B where an entire datacenter is lost

Assumptions

1.Three (3) vCenter servers will be used, One (1) at Datacenter A and Two (2) at Datacenter B
2. Environment has Two (2) Production clusters (One per Datacenter), and One (1) vCloud Cluster at Datacenter B each with a dedicated vCenter
3. Stretched clusters are not used
4. All vSphere Infrastructure servers (including SSO) are protected by SRM and vSphere HA
5. Inter-site Metropolitan Area Network is high bandwidth (>10Gb) , low latency (<5ms) and highly available (99.999%)
6. The average number of authentications per second for each SSO instance is <30 (Configuration Maximum)

Constraints

1. The environment uses traditional agent based backup solution which may not meet RPO/RTO requirements

Motivation

1. Future proof the environment

Architectural Decision

1. Use “Multi-site” SSO deployment mode
2. Do not use SSO “High Availability” clusters
3. The Primary SSO server will be at Datacenter B
4. The remaining vCenter servers will be “Secondaries” and point to the Datacenter B Primary SSO instance
5. The each SSO instance will be on a dedicated Windows 2008 x64 R2 instance
6. Each SSO instance will use the bundled SQL database
7. (Optional) For greater availability , vCenter Heartbeat will be used to protect each SSO instance

Justification

1. The environment is being designed (where) possible to sustain a Metropolitan Area Network failure between the two (2) datacenters

2. If “High Availability” mode is used, at least one (1) vCenter would be accessing SSO across the MAN link which introduces an unnecessary dependency on the MAN links

3. “High Availability” currently requires manual intervention which can be complicated and problematic

4. “Basic” mode prevents the use of Linked Mode which will make management of the environment more difficult

5. Using Multisite mode allows faster access to authentication services as each SSO instance is configured with Active Directory servers located at the same datacenter.

6. Multisite mode is required for the use of Linked-Mode and Linked Mode will  make day to day management easier

7. If one instance SSO goes offline for any reason, this will not impact production virtual machines. It will simply prevent any authentication to the affected vCenter server.

8. Having the SSO Primary at Datacenter B ensures only traffic from one vCenter (Datacenter A vCenter) traverses the MAN link as the third vCenter (for vCloud Director) is at Datacenter B

9. In the event of Datacenter B having a full datacenter wide failure for any reason, the Primary SSO instance being offline will not impact the management of Datacenter A OR the ability for the environment being recovered by SRM.

10. During an SSO upgrade, multiple vCenter’s cannot co-exist and using a centralized (or shared) SSO instance would overly complicate the upgrade process and lead to extended impact to the vSphere environments.

Alternatives

1. Use “Basic” Mode, resulting in a standalone version of SSO for each vCenter server

2. Use “High Availability Cluster” (Shared the same SSO database and identity sources) with one SSO server per physical datacenter

3. Use “Multisite” deployment with “High Availability Clusters” per datacenter

4. Host SSO database on a SQL Server

5. Run SSO on the vCenter server with or without the SSO database locally

6. Run a single SSO instance shared by all three (3) vCenters and use vCenter Heartbeat running across the MAN to protect SSO

Implications

1. Without a “High Availability Cluster” or SSO being protected by vCenter Heartbeat at each datacenter, the SSO for each site is a Single point of failure where authentication to the affected vCenter will fail

2. In the event of one (1) SSO server failing at Datacenter A, the SSO role does not failover to Datacenter B, or vice versa. In this case, All authentication requests on the site where SSO has failed, will fail.

3. Requires the installable version of SSO, which is Windows Only. The use of the vCenter Server Appliance (VCSA) is not available.

4. Additional Windows 2008 licenses are required for the SSO servers

Related Articles

1. Disabling Single Sign On – Dont Do It! – LongWhiteClouds

2. vSphere 5.1 Single Sign On (SSO) Configuration – Architectural Decision flowchart

I would like to Thank Michael Webster VCDX#66 (@vcdxnz001) for his contribution to this example architectural decision.

CloudXClogo

 

 

Example Architectural Decision – Number of paths per LUN for VMFS datastores

Problem Statement

In a vSphere environment hosting a large number of VMs,  Virtual machines I/O requirements range from small <100 IOPS to large business critical applications with tens of thousands of IOPS, the ESXi hosts have been configured with 4 x 8Gb FC HBAs.

What is the most suitable number of paths per LUN when using 4 x 8GB FC connections per Host, and how will they be presented in a highly available manner with two (2) SAN Fabrics connected to an Active/Active Enterprise Disk array?

Requirements

1. All LUNs are available on all FC Interfaces
2. The storage be highly available
3. The environment should be able to continue running production workloads in the unlikely event of a dual port HBA, or single Fabric failure.
4. The environment maintain a consistent level of performance

Assumptions

1. The Storage area network has two (2) fabrics each of which is highly available
2. The disk system is presented to both SAN fabrics
3. The number of VMs per host is >100
4. vSphere 4.0 or later
5. Storage array is Active/Active
6. ESXi hosts are large and are designed to drive significant I/O
7. VAAI is supported and enabled

Constraints

1. Maximum paths supported per ESXi host is 1024
2. Maximum number of datastores per ESXi host is 256

Motivation

1. Ensure optimal performance redundancy
2. Maximum the total capacity able to be presented to a cluster

Architectural Decision

Use a standard of 8 paths per LUN

Each LUN will be presented to each HBA via both Controller A and Controller B resulting in two paths per LUN per HBA.

With a total of 4 FC connections across two (2) physical dual port HBAs in a HA configuration with one (1) connection per HBA per Fabric, this equates to a total of 8 paths per LUN to the ESXi host (4 paths per Fabric)

Justification

1. This equates to 4 paths (1 per HBA interface per LUN) per Fabric
2. The use of VMware NMP with “Round Robin” will be used and having all LUNs presented via both fabrics and all HBAs will provide the maximum reducing in latency and the most consistent performance overall
3. 8 paths per LUN ensures up to 128 LUNs can be presented within the 1024 paths per ESXi host limit which will support sufficient capacity for the cluster
4. The solution is highly available as it uses two fabrics and both controllers are Active
5. In the event of a Fabric failure, the remaining Fabric serving 2 x 8Gb connections will provide connectivity to both Controller A and B, with a total of 4 paths
6. Ensures the cluster can have enough LUNs to balance workloads across which will assist keeping latency at a minimum

Alternatives

1. Have less paths per LUN which enabled the use of more LUNs
2. Have more paths per LUN and have less LUNs

Implications

1. LUN sizes will need to be sizes to ensure a maximum of 128 LUNs are sufficient from a capacity perspective to cater for the desired number of virtual machines

vmware_logo_ads