Example Architectural Decision – Site Recovery Manager Server – Physical or Virtual?

Problem Statement

To ensure Production vSphere environment/s can meet/exceed the required RTOs in the event of a declared site failure, What is the most suitable way to deploy VMware Site Recovery Manager, on a Physical or Virtual machine?

Requirements

1. Meet/Exceed RTO requirements

2. Ensure solution is fully supported

3. SRM be highly available, or be able to be recovered rapidly to ensure Management / Recovery of the Virtual infrastructure

4. Where possible, reduce the CAPEX and OPEX for the solution

5. Ensure the environment can be easily maintained in BAU

Assumptions

1. Sufficient compute capacity in the Management cluster for an additional VM

2. SRM database is hosted on an SQL server

3. vSphere Cluster (ideally Management cluster)  has N+1 availability

Constraints

1. None

Motivation

1. Reduce CAPEX and OPEX

2. Reduce the complexity of BAU maintenance / upgrades

3. Reduce power / cooling / rackspace usage in datacenter

Architectural Decision

Install Site Recovery Manager on a Virtual machine

Justification

1. Ongoing datacenter costs relating to Power / Cooling and Rackspace are avoided

2. Placing Site Recovery Management on a Virtual machine ensures the application benefits from the availability, load balancing, and fault resilience capabilities provided by vSphere

3. The CAPEX of a virtual machine is lower than a physical system especially when taking into consideration network/storage connectivity for the additional hardware where a physical server was used

4. The OPEX of a virtual machine is lower than a physical system due to no hardware maintenance, minimal/no additional power usage , and no cooling costs

3. Improved scale-ability and the ability to dynamically add additional resources (where required) assuming increased resource consumption by the VM. Note: The guest operating system must support Hot Add / Hot Plug and be enabled while the VM is shutdown. Where these features are not supported, virtual hardware can be added with a short outage.

4. Improved manageability as the VMware abstraction layer makes day to day tasks such as backup/recovery easier

5. Ability to non-disruptively migrate to new hardware where EVC is configured in compatible mode and enabled between hosts within a vSphere data center

Alternatives

1. Place SRM on a physical server

Implications

1. For some storage arrays, the SRM server needs to have access to admin LUNs and using a virtual machine may increase complexity by the requirement for RDMs

I would like to Thank James Wirth VCDX#83 (@jimmywally81) for his contribution to this example architectural decision.

Related Articles

1. Site Recovery Manager Deployment Location

2. Swap file location for SRM protected VMs

CloudXClogo

 

 

Example Architectural Decision – Number of paths per LUN for VMFS datastores

Problem Statement

In a vSphere environment hosting a large number of VMs,  Virtual machines I/O requirements range from small <100 IOPS to large business critical applications with tens of thousands of IOPS, the ESXi hosts have been configured with 4 x 8Gb FC HBAs.

What is the most suitable number of paths per LUN when using 4 x 8GB FC connections per Host, and how will they be presented in a highly available manner with two (2) SAN Fabrics connected to an Active/Active Enterprise Disk array?

Requirements

1. All LUNs are available on all FC Interfaces
2. The storage be highly available
3. The environment should be able to continue running production workloads in the unlikely event of a dual port HBA, or single Fabric failure.
4. The environment maintain a consistent level of performance

Assumptions

1. The Storage area network has two (2) fabrics each of which is highly available
2. The disk system is presented to both SAN fabrics
3. The number of VMs per host is >100
4. vSphere 4.0 or later
5. Storage array is Active/Active
6. ESXi hosts are large and are designed to drive significant I/O
7. VAAI is supported and enabled

Constraints

1. Maximum paths supported per ESXi host is 1024
2. Maximum number of datastores per ESXi host is 256

Motivation

1. Ensure optimal performance redundancy
2. Maximum the total capacity able to be presented to a cluster

Architectural Decision

Use a standard of 8 paths per LUN

Each LUN will be presented to each HBA via both Controller A and Controller B resulting in two paths per LUN per HBA.

With a total of 4 FC connections across two (2) physical dual port HBAs in a HA configuration with one (1) connection per HBA per Fabric, this equates to a total of 8 paths per LUN to the ESXi host (4 paths per Fabric)

Justification

1. This equates to 4 paths (1 per HBA interface per LUN) per Fabric
2. The use of VMware NMP with “Round Robin” will be used and having all LUNs presented via both fabrics and all HBAs will provide the maximum reducing in latency and the most consistent performance overall
3. 8 paths per LUN ensures up to 128 LUNs can be presented within the 1024 paths per ESXi host limit which will support sufficient capacity for the cluster
4. The solution is highly available as it uses two fabrics and both controllers are Active
5. In the event of a Fabric failure, the remaining Fabric serving 2 x 8Gb connections will provide connectivity to both Controller A and B, with a total of 4 paths
6. Ensures the cluster can have enough LUNs to balance workloads across which will assist keeping latency at a minimum

Alternatives

1. Have less paths per LUN which enabled the use of more LUNs
2. Have more paths per LUN and have less LUNs

Implications

1. LUN sizes will need to be sizes to ensure a maximum of 128 LUNs are sufficient from a capacity perspective to cater for the desired number of virtual machines

vmware_logo_ads

Example Architectural Decision – Guest OS Page File Storage in vSphere

Problem Statement

In a vSphere environment using deduplication and an array snapshot based backup solution, Guest OS page files are currently stored on the OS drive (VMDK) which reduces the effectiveness of deduplication as well as placing an overhead on the controllers having to scan data which cannot be deduplicated.

As the Guest OS Paging files are being included in the snapshot process (with the guest OS) this also demands additional capacity for both primary and secondary disk storage for disk to disk backups.

How can this overhead be minimized or eliminated?

Requirements

1. Make the most efficient use of the available storage capacity
2. Maintain consistent level of virtual machine / storage performance
3. Minimize the storage required for primary and secondary snapshot based backups
4. Maintain the array level snapshot based backup solution as it is required to meet RPO/RTOs
5. Maintain the use of deduplication and this has proven to decrease storage requirements and improve performance

Assumptions

1. vSphere 5.0 or later
2. VMFS 5 Datastores which are Thin Provisioned
3. Deduplication is in use for Volumes where Guest OS virtual disks are stored
4. VAAI is supported by the array and enabled across the vSphere environment
5. All datastores are presented to all hosts within the cluster
6. Snapshot based backup solution is being used
7. Virtual Machines are right sized
8. Disk to disk backup data is replicated offsite

Constraints

1. None

Motivation

1. Optimize the storage performance
2. Ensure Tier 1 storage is not wasted with transient files
3. Minimize storage required for snapshot based backups

Architectural Decision

Separate OS page files onto a dedicated VMDK, which will be located on a datastore (or datastore cluster) which is
1. Not Protected by the array level snapshot backup solution
2. Not running deduplication
3. Not running data compression

Justification

1. Allows page files to be stored on different underlying storage including (optionally) high capacity, lower cost, SATA disk
2. Relocating Guest OS page files to another datastore (or datastore cluster) not protected ny snapshots dramatically reduces the amount of Data being protected by the Snapshot based backup solution
3. Reduces the amount of data being replicated to secondary disk backup location/s thus minimizing the bandwidth requirements between datacenters
4. (Optionally) Ensures Tier 1 storage is only used for high performance guests
5. The result of the Virtual Machines being right sized the performance impact/frequency of paging should be minimal
6. Reduces the CPU cycles required for deduplication as data which cannot be deduplicated will not be scanned
7. Reduces the CPU cycles on the storage controllers by not attempting to compress page file data

Alternatives

1. Leave Page Files within the Virtual machines primary VMDK an accept the overhead on the backup solution
2. Turn of paging within the Guest OS (No Page File)

Implications

1. The additional steps of creating a dedicated VMDK for the VM and configuring the Guest OS to use the alternate location
2. Templates need to be updated to the above configuration
3. For environments using Site Recovery Manager,for protected virtual machines, some manual steps are required when setting up the virtual machines for the first time. This increases the work required during setup, however as this is a one time overhead, it is believed the benefit of reduced backup storage and replication traffic (for SRM) outweighs the one time overhead

vmware_logo_ads