Storage DRS and Nutanix – To use, or not to use, that is the question?

Storage DRS (SDRS) is an excellent feature which was released with vSphere 5.0 in late 2011. For those of you who are not familiar with SDRS I recommend reading the following article prior to reading the rest of this post as SDRS knowledge is assumed from now on.

Understanding VMware vSphere 5.1 Storage DRS

This post also assumes basic knowledge of the Nutanix platform, for those of you who are not familiar with Nutanix please review the following links prior to reading the remainder of this post.

About Nutanix | How Nutanix Works | 8 Strategies for a Modern Datacenter

Storage DRS & Nutanix – To use, or not to use, that is the question?

With Storage DRS (SDRS), both capacity and performance can be managed, but what should SDRS manage in a Nutanix environment?

Lets start with performance. SDRS can help ensure optimal performance of virtual machines by enabling the I/O metric for SDRS recommendations as shown in the screen shot below.

SDRSsettingsIOmetricCircledSmall

Once this is done, SDRS will evaluate I/O every 8 hours (by default) and where the configured latency threshold is exceeded, perform a cost/benefit analysis before deciding to make a migration recommendation or do nothing.

So the question is, does SDRS add value in a Nutanix environment from a performance perspective?

The Nutanix solution adopts the “Scale-out” methodology by having one (1) Nutanix Controller VM (CVM) per Nutanix Node (ESXi Host) and then presents NFS datastore/s to the vSphere cluster which are serviced by all CVMs. The CVMs use intelligent auto-tiering to ensure optimal performance. The way this works at a high level, is as follows.

Data is written to an SSD tier (either PCIe SSD such as Fusion-io or SATA SSD) before being migrated off to a SATA tier once the blocks are determined to be “Cold” and if/when required, promoted back the an SSD tier when they become “Hot” again for improved read performance.

As with other vendor storage solutions with auto tiering technologies (such as FAST-VP , FlashPools etc) the same recommendation around SDRS and the I/O metric is true for Nutanix, leave it disabled.

So, at this point we have concluded the I/O metric will be “Disabled”, lets move onto Capacity management.

The Nutanix solution presents large NFS datastore/s to the ESXi hosts (Nutanix nodes) which are shared across all ESXi hosts in one or more vSphere clusters.

When using SDRS, it can manage initial placement of a new Virtual machine based on the configured “Utilized Space” metric (shown below) to ensure there is not a capacity imbalance between the datastores in a datastore cluster, as well as move virtual machines around when new machines are provisioned to ensure the balance is maintained.

UtilizedSpaceSDRS

So this is a really good feature which I have and do recommend in several scenarios, however the Nutanix solution presents typical a small number of large NFS datastores to the vSphere cluster (or clusters) which are serviced by all Controller VMs (CVMs) in the Nutanix cluster. Using SDRS for initial placement does not add much (if any) value as the initial placement will almost always be on the same large NFS datastore.

Where actual physical capacity becomes an issue, space saving technologies such as compression can be enabled, or the environment can be granularly scaled by adding just a single additional Nutanix node which linearly scales the solution from both a capacity and performance perspective.

The only real choice is when you choose to present two (or more) datastores where one datastore leverage’s the Nutanix compression technology. This is a very easy scenario for a vSphere admin to choose the placement of a VM and is the same amount of administrative effort as choosing a datastore cluster which would be a collection of datastores either using compression, or not depending on the workloads.

As a result there is no advantage to using SDRS to manage utilized space.

In conclusion, Storage DRS is a great feature when used with storage arrays where performance does not scale linearly or provide intelligent tiering to address I/O bottlenecks and/or where your environment has large numbers of datastores where you need to actively manage capacity.

As performance and capacity management are intelligently managed natively by the Nutanix solution, the requirement (or benefit) provided by SDRS is negated, as a result there is no requirement or benefit for using SDRS with a Nutanix solution.

Related Articles

1. Example Architectural Decision – VMware DRS automation level for a Nutanix environment

 

 

Example Architectural Decision – ESXi Host Hardware Sizing (Example 1)

Problem Statement

What is the most suitable hardware specifications for this environments ESXi hosts?

Requirements

1. Support Virtual Machines of up to 16 vCPUs and 256GB RAM
2. Achieve up to 400% CPU overcommitment
3. Achieve up to 150% RAM overcommitment
4. Ensure cluster performance is both consistent & maximized
5. Support IP based storage (NFS & iSCSI)
6. The average VM size is 1vCPU / 4GB RAM
7. Cluster must support approx 1000 average size Virtual machines day 1
8. The solution should be scalable beyond 1000 VMs (Future-Proofing)
9. N+2 redundancy

Assumptions

1. vSphere 5.0 or later
2. vSphere Enterprise Plus licensing (to support Network I/O Control)
3. VMs range from Business Critical Application (BCAs) to non critical servers
4. Software licensing for applications being hosted in the environment are based on per vCPU OR per host where DRS “Must” rules can be used to isolate VMs to licensed ESXi hosts

Constraints

1. None

Motivation

1. Create a Scalable solution
2. Ensure high performance
3. Minimize HA overhead
4. Maximize flexibility

Architectural Decision

Use Two Socket Servers w/ >= 8 cores per socket with HT support (16 physical cores / 32 logical cores) , 256GB Ram , 2 x 10GB NICs

Justification

1. Two socket 8 core (or greater) CPUs with Hyper threading will provide flexibility for CPU scheduling of large numbers of diverse (vCPU sized) VMs to minimize CPU Ready (contention)

2. Using Two Socket servers of the proposed specification will support the required 1000 average sized VMs with 18 hosts with 11% reserved for HA to meet the required N+2 redundancy.

3. A cluster size of 18 hosts will deliver excellent cluster (DRS) efficiency / flexibility with minimal overhead for HA (Only 11%) thus ensuring cluster performance is both consistent & maximized.

4. The cluster can be expanded with up to 14 more hosts (to the 32 host cluster limit) in the event the average VM size is greater than anticipated or the customer experiences growth

5. Having 2 x 10GB connections should comfortably support the IP Storage / vMotion / FT and network data with minimal possibility of contention. In the event of contention Network I/O Control will be configured to minimize any impact (see Example VMware vNetworking Design w/ 2 x 10GB NICs)

6. RAM is one of the most common bottlenecks in a virtual environment, with 16 physical cores and 256GB RAM this equates to 16GB of RAM per physical core. For the average sized VM (1vCPU / 4GB RAM) this meets the CPU overcommitment target (up to 400%) with no RAM overcommitment to minimize the chance of RAM becoming the bottleneck

7. In the event of a host failure, the number of Virtual machines impacted will be up to 64 (based on the assumed average size VM) which is minimal when compared to a Four Socket ESXi host which would see 128 VMs impacted by a single host outage

8. If using Four socket ESXi hosts the cluster size would be approx 10 hosts and would require 20% of cluster resources would have to be reserved for HA to meet the N+2 redundancy requirement. This cluster size is less efficient from a DRS perspective and the HA overhead would equate to higher CapEx and as a result lower the ROI

9. The solution supports Virtual machines of up to 16 vCPUs and 256GB RAM although this size VM would be discouraged in favour of a scale out approach (where possible)

10. The cluster aligns with a virtualization friendly “Scale out” methodology

11. Using smaller hosts (either single socket, or less cores per socket) would not meet the requirement to support supports Virtual machines of up to 16 vCPUs and 256GB RAM , would likely require multiple clusters and require additional 10GB and 1GB cabling as compared to the Two Socket configuration

12. The two socket configuration allows the cluster to be scaled (expanded) at a very granular level (if required) to reduce CapEx expenditure and minimize waste/unused cluster capacity by adding larger hosts

13. Enabling features such as Distributed Power Management (DPM) are more attractive and lower risk for larger clusters and may result in lower environmental costs (ie: Power / Cooling)

Alternatives

1.  Use Four Socket Servers w/ >= 8 cores per socket , 512GB Ram , 4 x 10GB NICs
2.  Use Single Socket Servers w/ >= 8 cores , 128GB Ram , 2 x 10GB NICs
3. Use Two Socket Servers w/ >= 8 cores , 512GB Ram , 2 x 10GB NICs
4. Use Two Socket Servers w/ >= 8 cores , 384GB Ram , 2 x 10GB NICs
5. Have two clusters of 9 hosts with the recommended hardware specifications

Implications

1. Additional IP addresses for ESXi Management, vMotion, FT & Out of band management will be required as compared to a solution using larger hosts

2. Additional out of band management cabling will be required as compared to a solution using larger hosts

Related Articles

1. Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage (4 x 10 GB NICs)

2. Example VMware vNetworking Design w/ 2 x 10GB NICs

3. Network I/O Control Shares/Limits for ESXi Host using IP Storage

4. VMware Clusters – Scale up for Scale out?

5. Jumbo Frames for IP Storage (Do not use Jumbo Frames)

6. Jumbo Frames for IP Storage (Use Jumbo Frames)

CloudXClogo

 

Storage DRS Configuration – Architectural Decision making flowchart

I was speaking to a number of people recently, who were trying to come up with a one size fits all Storage DRS configuration for a reference architecture document.

As Storage DRS is a reasonably complicated feature, it was my opinion that a one size fits all would not be suitable, and that multiple examples should be provided when writing a reference architecture.

A collegue suggested a flowchart would assist in making the right decision around Storage DRS, so I took up the challenge to put one together.

The below is my version 0.1 of the flowchart, which I thought I would post and hopefully get some good feedback from the community, and create a good guide for those who may not have the in-depth knowledge or experience, too choose what should be in most cases an appropriate configuration for SDRS.

This also compliments some of my previous example architectural decisions which are shown in the related topic section below.

As always, feedback is always welcomed.

I hope you find this helpful.

* Updated to include the previously missing “NO” option for Data replication.

SDRS flowchart V0.2

Related Articles

1. Example Architectural Decision – Storage DRS configuration for NFS datastores

2. Example Architectural Decision – Storage DRS configuration for VMFS datastores