Example Architectural Decision – Virtual Machine Swap file location for SRM protected VMs

Problem Statement

In an environment where multiple vSphere clusters are protected by VMware Site Recovery Manager (SRM) with array based replication. What is the best way to ensure the RTO/RPO is met/exceeded and to minimize the storage replication overhead?

Assumptions

1. Additional storage will not be obtained

2. Eight (8) Paths per LUN are Masked/Zoned

Motivation

1. Optimize underlying storage usage

2. Ensure transient files are not unnesasarily replicated

Architectural Decision

Configure vSphere cluster swapfile policy to Store the swapfile in the datastore specified by the host.

Create and configure a dedicated swap file datastores provided by Tier 1 storage with greater than the capacity of the total vRAM for the cluster itself, along with any/all clusters using the cluster/s as recovery sites.

Justification

1.Decreased storage replication between protected and recovery sites

2. Reduced impact to the underlying storage due to reduced replication

3. Reduces the used space at the recovery site

4. No impact to the ability to recovery to, or failback from the recovery site

5. vMotion performance will not be impacted as all hosts within a cluster share the same swap file datastore which is provided from the existing shared storage

6. There is minimal complexity in setting a dedicated swap file datastore as such, the benefits outweigh the additional complexity

7. In the event of swapping, performance will not be impacted as the swap file is on Tier 1 storage

8. There is no additional Tier 1 storage utilization as the vswap file would alternatively be set to “Store in the same director as the virtual machine”

9. Ensures memory (RAM) over commitment can still be achieved where as setting memory reservations would reduce/eliminate this benefit of vSphere

Implications

1. vMotion performance between clusters will be degraded as the swap file will be moved as part of the vMotion to the destination cluster swap file datastore

2. One (1) datastore out of a maximum of 256 per host are used for the swap file datastore

3. Eight (8) paths out of a maximum of 1024 per host are used for the swap file datastore

Alternatives

1. Store the swapfile in the same directory as the virtual machine

2. Set Virtual machine memory reservations of 100% to eliminate the vswap file

Relates Articles

1. Site Recovery Manager Deployment Location

2. VMware Site Recovery Manager, Physical or Virtual machine?

 

Example Architectural Decision – Site Recovery Manager Server – Physical or Virtual?

Problem Statement

To ensure Production vSphere environment/s can meet/exceed the required RTOs in the event of a declared site failure, What is the most suitable way to deploy VMware Site Recovery Manager, on a Physical or Virtual machine?

Requirements

1. Meet/Exceed RTO requirements

2. Ensure solution is fully supported

3. SRM be highly available, or be able to be recovered rapidly to ensure Management / Recovery of the Virtual infrastructure

4. Where possible, reduce the CAPEX and OPEX for the solution

5. Ensure the environment can be easily maintained in BAU

Assumptions

1. Sufficient compute capacity in the Management cluster for an additional VM

2. SRM database is hosted on an SQL server

3. vSphere Cluster (ideally Management cluster)  has N+1 availability

Constraints

1. None

Motivation

1. Reduce CAPEX and OPEX

2. Reduce the complexity of BAU maintenance / upgrades

3. Reduce power / cooling / rackspace usage in datacenter

Architectural Decision

Install Site Recovery Manager on a Virtual machine

Justification

1. Ongoing datacenter costs relating to Power / Cooling and Rackspace are avoided

2. Placing Site Recovery Management on a Virtual machine ensures the application benefits from the availability, load balancing, and fault resilience capabilities provided by vSphere

3. The CAPEX of a virtual machine is lower than a physical system especially when taking into consideration network/storage connectivity for the additional hardware where a physical server was used

4. The OPEX of a virtual machine is lower than a physical system due to no hardware maintenance, minimal/no additional power usage , and no cooling costs

3. Improved scale-ability and the ability to dynamically add additional resources (where required) assuming increased resource consumption by the VM. Note: The guest operating system must support Hot Add / Hot Plug and be enabled while the VM is shutdown. Where these features are not supported, virtual hardware can be added with a short outage.

4. Improved manageability as the VMware abstraction layer makes day to day tasks such as backup/recovery easier

5. Ability to non-disruptively migrate to new hardware where EVC is configured in compatible mode and enabled between hosts within a vSphere data center

Alternatives

1. Place SRM on a physical server

Implications

1. For some storage arrays, the SRM server needs to have access to admin LUNs and using a virtual machine may increase complexity by the requirement for RDMs

I would like to Thank James Wirth VCDX#83 (@jimmywally81) for his contribution to this example architectural decision.

Related Articles

1. Site Recovery Manager Deployment Location

2. Swap file location for SRM protected VMs

CloudXClogo

 

 

Example Architectural Decision – Host Isolation Response for FC Based storage

Problem Statement

What are the most suitable HA / host isolation settings where the environment uses Storage (IBM SVC) with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric where ESXi Management and Virtual Machine traffic run over a highly available data network?

Requirements

1. Ensure in the event of one or more hosts becoming isolated, the environment responds in an automated manner to recover VMs where possible

Assumptions

1.The Network is highly available (>99.999% availability)
2. The Storage is highly available (>99.999% availability)
3. vSphere 5.0 or later
4. ESXi hosts are connected to the network via two physical separate switches via two physical NICs

Constraints

1. FC (Block) based storage

Motivation

1. Meet/Exceed availability requirements
2. Minimize the chance of a false positive isolation event

Architectural Decision

Turn off the default isolation address by setting the below advanced setting

“das.usedefaultisolationaddress” = False

Configure three (3) isolation addresses by setting the below advanced settings

“das.isolationaddress1″ = 192.168.1.1 (Core Router)

“das.isolationaddress2″ = 192.168.1.2 (Core Switch 1 )

“das.isolationaddress3″ = 192.168.1.3 (Core Switch 2 )

Configure Datastore Heartbeating with “Select any of the clusters datastores”

Configure Host Isolation Response to: “Shutdown”

Justification

1. When using FC storage, it is possible for the Management and Virtual Machine Networks to be unavailable, while the Storage network is working perfectly. In this case Virtual machines may not be able to communicate to other servers, but can continuing reading/writing from disk. In this case, they will likely not be servicing customer workloads, as such, Shutting the VM down gracefully allows HA to restart the VM/s on host/s which are not isolated gives the VM a greater chance of being able to resume servicing workloads than remaining on an isolated host.
2. Datastore heartbeating will allow HA to confirm if the host is “isolated” or “failed”. In either case, Shutting down the VM will allow HA to recover the VM on a surviving host.
3. As all storage is presented via Active/Active IBM SVC controllers, there is no benefit is specifying specific datastores to be used for heartbeating
4. The selected isolation addresses were chosen as they are both highly available devices in the network which are essential for network communication and cover the core routing and switching components in the network.
5. In an environment where the Network is highly available an isolation event is extremely unlikely  as such, where the three (3) isolation addresses cannot be contacted, it is unlikely the network can be restored in a timely manner OR the host has suffered multiple concurrent failures (eg: Multiple Network Cards) and performing a controlled shutdown helps ensure when the network is recovered, the VMs are brought back up in a consistent state, OR in the event the isolation impacts only a subset of ESXi hosts in the cluster, the VM/s can be recovered by HA and resume normal operations.

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be Shutdown
2.  Using “Shutdown” as opposed to “Power off” ensures a graceful shutdown of the guest operating system, however this will delay the HA restart of the VM for up to 5 mins (300 seconds) if VMware Tools is unable to do a controlled shutdown, in which case after 300 seconds a “Power Off” will be executed.
3. In the unlikely event of network instability, VMs may be Shutdown prematurely.

CloudXClogo