Example Architectural Decision – BC/DR Solution for vCloud Director

Problem Statement

What is the most suitable BC/DR solution for a vCloud director environment?

Requirements

1. Ensure the vCloud solution can tolerate a site failure in an automated manner
2. Ensure the vCloud solution meets/exceeds the RTO of 4hrs
3. Comply with all requirements of the Business Continuity Plan (BCP)
4. Solution must be a supported vSphere / vCloud Configuration
5. Ensure all features / functionality of the vCloud solution are available following a DR event

Assumptions

1. Datacenters are in an Active/Active configuration
2. Stretched Layer 2 network across both datacenters
3. Storage based replication between sites
4. vSphere 5.0 Enterprise Plus or later
5. VMware Site Recovery Manager 5.0 or later
6, vCloud Director 1.5 or later
7. There is no requirement for workloads proposed to be hosted in vCloud to be at one datacenter or another

Constraints

1. The hardware for the solution has already been chosen and purchased. 6 x 4 Way, 32 core Hosts w/ 512GB RAM and 4 x 10GB
2. The storage solution is already in place and does not support a Metro Storage Cluster (vMSC) configuration

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity

Architectural Decision

Use the vCloud DR solution as described in the “vCloud Director Infrastructure Resiliency Case Study” (By Duncan Epping @duncanyb and Chris Colotti @Ccolotti )

In Summary, Host the vSphere/vCloud Management virtual machines on an SRM protected cluster.

Use a dedicated cluster for vCloud compute resources.

Configure the vSphere cluster which is dedicated to providing compute resources to the vCloud environment (Provider virtual data center – PvDC) to have four (4) compute nodes  located at Datacenter A for production use and two (2) compute nodes located at Datacenter B (in ”Maintenance mode”) dedicated to DR.

Storage will not be stretched across sites; LUNs will be presented locally from “Datacenter A” shared storage to the “Datacenter A” based hosts. The “Datacenter A” storage will be replicated synchronously to “Datacenter B” and presented from “Datacenter B” shared storage to the two (2) “Datacenter B” based hosts. (No stretched Storage between sites)

In the event of a failure, SRM will recover the vSphere/vCloud Management virtual machines bringing back online the Cloud, then a script as the last part of the SRM recovery plan, Mounts the replicated storage to the ESXi hosts in “Datacenter B” and takes the two (2) hosts at “Datacenter B” out of maintenance mode. HA will then detect the virtual machines and power on them on.

Justification

1. Stretched Clusters are more suited to Disaster Avoidance than Disaster Recovery
2. Avoids complex and manual  intervention in the case of a disaster in the case of a stretched cluster solution
3. A Stretched cluster provides minimal control in the event of a Disaster where as in this case, HA simply restarts VMs once the storage is presented (automatically) and the hosts are taken out of Maintenance mode (also automated)
4. Having  two (2) ESXi hosts for the vCloud resource cluster setup in “Datacenter B” in “Maintenance Mode” and the storage mirrored as discussed  allows the virtual workloads to be recovered in an automated fashion as part of the VMware Site Recovery Manager solution.
5. Removes the management overhead of managing a strecthed cluster using features such as DRS affinity rules to keep VMs on the hosts on the same site as the storage
6. vSphere 5.1 backed resource clusters support >8 host clusters for “Fast provisioning”
7. Remove the dependency on the Metropolitan Area Data and Storage networks during BAU and the potential impact of the latency between sites on production workloads
8. Eliminates the chance of a “Split Brain” or a “Datacenter Partition” scenario where VM/s can be running at both sites without connectivity to each other
9. There is no specific requirement for non-disruptive mobility between sites
10. Latency between sites cannot be guaranteed to be <10ms end to end

Alternatives

1. Stretched Cluster between “Datacenter A” and “Datacenter B”
2. Two independent vCloud deployments with no automated DR
3. Have more/less hosts at the DR site in the same configuration

Implications

1. Two (2) ESXi hosts in the vCloud Cluster located in “Datacenter B” will remain unused as “Hot Standby” unless there is a declared site failure at “Datacenter A”
2. Requires two (2) vCenter servers , one (1) per Datacenter
3. There will be no non-disruptive mobility between sites (ie: vMotion)
4. SRM protection groups / plans need to be created/managed Note: This will be done as part of the Production cluster
5. In the event of a DR event, only half the compute resources will be available compared to production.
6. Depending on the latency between sites, storage performance may be reduced by the synchronous replication as the write will not be acknowledged to the VM at “Datacenter A” until committed to the storage at “Datacenter B”

CloudXClogo

 

 

Nested ESXi 5.1 on IBM x3850 M2 (7141-xxx model)

For my test lab I use an IBM x3850 M2 model 7141-3RM. Its a 4 socket server, with 4 cores per socket and 96Gb Ram running nested ESXi hosts.

Until recently, I was still running ESXi 5.0 on the physical host, with nested ESXi 5.1 hosts which worked perfectly.

Recently I decided to upgrade the physical host to ESXi 5.1.

Problem: My nested ESXi 5.1 hosts can no longer run 64bit VMs.

Now why is that? The Intel E7330 supports Intel-VT-x w Extended Page Tables (see Intel’s website here) or so they say?!

IntelEssentials IntelAdvancedTech

As mentioned by William Lam (@lamw) on his blog (VirtuallyGhetto)

A quick way to verify whether your CPU truly supports both Intel-VT+EPT or AMD-V+RVI, you can paste the following into a browser:

https://[your-esxi-host-ip-address]/mob/?moid=ha-host&doPath=capability

So, after running this on my host, I got the following result

nestedHVsupport

So that’s a bummer, I cant run ESXi 5.1 on my physical host with nested ESXi hosts running 64 bit VMs.

Note: 32bit VMs run fine.

So, Intel are telling fibs about what this processor can do, but that doesn’t help solve the issue.

Solution: If you have this processor in your test lab and/or have the IBM x3850 M2 model 7141-xxx (or you obtain one as they are a great low cost option for a test lab) run ESXi 5.0 on the physical host, and run ESXi 5.1 nested and you can run 64bit VMs.

Example VMware vNetworking Design for IP Storage

On a regular basis, I am being asked how to configure vNetworking to support environments using IP Storage (NFS / iSCSI).

The short answer is, as always, it depends on your requirements, but the below is an example of a solution I designed in the past.

Requirements

1. Provide high performance and redundant access to the IP Storage (in this case it was NFS)
2. Ensure ESXi hosts could be evacuated in a timely manner for maintenance
3. Prevent significant impact to storage performance by vMotion / Fault Tolerance and Virtual machines traffic
4. Ensure high availability for ESXi Management / VMKernel and Virtual Machine network traffic

Constraints

1. Four (4) x 10GB NICs
2. Six (6) x 1Gb NICs (Two onboard NICs and a quad port NIC)

Note: So in my opinion the above NICs are hardly “constraining” but still important to mention.

Solution

Use a standard vSwitch (vSwitch0) for ESXi Management VMKernel. Configure vmNIC0 (Onboard NIC 1) and vmNIC2 (Quad Port NIC – port 1)

ESXi Management will be Active on vmNIC0 and vmNIC2 although it will only use one path at any given time.

Use a Distributed Virtual Switch (dvSwitch-admin) for IP Storage , vMotion and Fault Tolerance.

Configure vmNIC6 (10Gb Virtual Fabric Adapter NIC 1 Port 1) and vmNIC9 (10Gb Virtual Fabric Adapter NIC 2 Port 2)

Configure Network I/O with NFS traffic having a share value of 100 and vMotion & FT will each have share value of 25

Each VMKernel for NFS will be active on one NIC and standby on the other.

vMotion will be Active on vmNIC6 and Standby on vmNIC9 and Fault Tolerance vice versa.

vNetworking Example dvSwitch-Admin

Use a Distributed Virtual Switch (dvSwitch-data) for Virtual Machine traffic

Configure vmNIC7 (10Gb Virtual Fabric Adapter NIC 1 Port 2) and vmNIC8 (10Gb Virtual Fabric Adapter NIC 2 Port 1)

Conclusion

While there are many ways to configure vNetworking, and there may be more efficient ways to achieve the requirements set out in this example, I believe the above configuration achieves all the customer requirements.

For example, it provides high performance and redundant access to the IP Storage by using two (2)  VMKernel’s each active on one 10Gb NIC.

IP storage will not be significantly impacted during periods of contention as Network I/O control will ensure in the event of contention that the IP Storage traffic has ~66% of the available bandwidth.

ESXi hosts will be able to be evacuated in a timely manner for maintenance as

1. vMotion is active on a 10Gb NIC, thus supporting the maximum 8 concurrent vMotion’s
2. In the event of contention, worst case scenario vMotion will receive just short of 2GB of bandwidth. (~1750Mb/sec)

High availability is ensured as each vSwitch and dvSwitch has two (2) connections from physically different NICs and connect to physically separate switches.

Hopefully you have found this example helpful and for a example Architectural Decision see Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage