Rule of Thumb: Sizing for Storage Performance in the new world.

In the new world where storage performance is decoupled with capacity with new read/write caching and Hyper-Converged solutions, I always get asked:

How do I size the caching or Hyper-Converged solution to ensure I get the storage performance I need.

Obviously I work for Nutanix, so this question comes from prospective or existing Nutanix customers, but its also relevant to other products in the market, such as PernixData or any Hybrid (SSD+SAS/SATA) solution.

So for indicative sizing (i.e.: Presales) where definitive information is not available and/or where you cannot conduct a detailed assessment , I use the following simple Rule of Thumb.

Take your last two monthly full backups, and take the delta between them and multiply that by 3.

So if my full backup from August was 10TB and my full backups from September is 11TB, my delta is 1TB. I then multiply that by 3 and we get 3TB which is our assumption of the “Active Working Set” or in basic terms, the data which needs performance. (Because cold or inactive data can sit on any tier without causing performance issues).

Now I  size my SSD tier for 3TB of usable capacity.

The next question is:

Why multiple the backup data delta by 3?

This is based on an assumption (since we don’t have any hard data to go on) that the Read/Write ratio is 70% Read, 30% write.

Now those of you familiar with this thing called Maths, would argue 70/30 is 2.33333 which is true. So rounding up to 3 is essentially a buffer.

I have found this rule of thumb works very well, and customers I have worked with have effectively had All Flash Array performance because the “Active Working Set” all resides within the SSD tier.

Caveats to this rule of thumb.

1. If a customer does a significant amount of deletions during the month, the delta may be smaller and result in an undersized SSD tier.

Mitigation: Review several months of full backup logs and average the delta.

2. If the environment’s Read/Write ratio is much higher than 70/30, then the delta from the backup multiplied by 3 may again result in  an undersized SSD tier.

Mitigation: Perform some investigation into your most critical workloads and validate or correct the assumption of multiplying by 3

3. This rule of thumb is for Server workloads, not VDI.

VDI Read/Write ratio is generally almost opposite to server, and around 30/70 Read/Write. However the SSD tier for VDI should be sized taking into account the benefits of VAAI/VCAI cloning and things like de duplication (for Memory and SSD tiers) which some products, like Nutanix offer.

Summary / Disclaimer

This rule of thumb works for me 90% of the time when designing Nutanix solutions, but your results may vary depending on the platform you use.

I welcome any feedback or suggestions of alternate sizing strategies which I will update the post with where appropriate.

Is VAAI beneficial with Virtual Storage Appliance (VSA) based solutions ?

I saw a tweet recently (below) which inspired me to write this post as there is still a clear misunderstanding of the benefits VAAI provides (even with Virtual Storage Appliances).

vaaionvsatweet2

I have removed the identity of the individual who wrote the tweet and the people who retweeted this as the goal of this post is solely to correct what I believe is mis-information.

My interpretation of the tweet was (and remains) if a solution uses a Virtual Storage Appliance (VSA) which resides on the ESXi host then VAAI is not providing any benefits.

My opinion on this topic is:

Compared to a traditional centralised NAS (such as a Netapp or EMC Isilon) providing NFS storage with VAAI-NAS support, a Nutanix or VSA solution has exactly the same benefits from VAAI!

My 1st reply to the tweet was:

vaaionvsatweetconvojoshreply2

The test I was referring to with Netapp OnTap Edge can be found here which was posted in Jan 2013, well prior to my joining Nutanix when I was working for IBM where I had been evangelising VAAI/VCAI based solutions for a long time as VAAI/VCAI provides significant value to VMware customers.

The following shows the persons initial reply to my tweet.

vaaionvsatweetconvo32

I responded with the below mentioning I will do a blog which is what you’re reading now.

I went onto provide some brief replies as shown below.

repliesdetail

The main comments from this persons tweets I would summarize (rightly or wrongly) below:

  • VAAI is designed only to offload functions externally (or off the ESXi host)
  • He/She had not seen any proof of performance advantages from VAAI on VSAs
  • Its broken logic to use VAAI with a VSA

Firstly, I would like comment on VAAI being designed to offload functions externally (or off the ESXi host). I don’t disagree VAAI has some functions designed to offload to the (centralised) array but VAAI also has numerous functions which are designed to bring other efficiencies to a vSphere environment.

An example of a feature designed to offload to a central array is the “XCOPY” primitive.

A simple example of what “XCOPY” or Extended Copy provides is offloading a Storage vMotion on block based storage (i.e.: VMFS over iSCSI,FC,FCoE not NFS) to the array so the ESXi host does not have to process the data movement.

This VAAI primitive would likely be of little benefit in a VSA environment where the storage is presented is block based and Storage DRS for example was used. The data movement would be offloaded from ESXi to the VSA running on ESXi and host would still be burdened with the SvMotion.

However XCOPY is only one of the many primitives of VAAI, and VAAI does alot more than just offload Storage vMotions.

For the purpose of this post, I will be discussing VAAI with Nutanix whos Software defined storage solution runs in a VM on every ESXi host in a Nutanix cluster.
Note: This information is also relevant to other VSAs which support VAAI-NAS.

So what benefit does VAAI provide to Nutanix or a VSA solution running NFS?

Nutanix deploys by default with NFS and supports the VAAI-NAS primitives which are:

  • Full File Clone
  • Fast File Clone
  • Reserve Space
  • Extended Statistics

Note: XCOPY is not supported on NFS, importantly and specifically speaking for Nutanix it is not required as SvMotion will be rarely if ever used with Nutanix solutions.

See my post “Storage DRS and Nutanix – To use, or not to use, that is the question?” for more details on why SvMotion is rarely needed when using Nutanix.

For more details of VAAI primitives, Cormac Hogan (@CormacJHogan) wrote an excellent post which can be found here.

Now here is an example of a significant performance benefits of VAAI with Nutanix.

Lets look at Clone of a VM on a Nutanix platform, the VMs details are below.admin01vm

The VM I have used for this test resides on a datastore called “Management” (as per the above image) which presented via NFS and has VAAI (Hardware Acceleration) enabled as shown below.datastore

Now if I do a simple clone of a VM (as shown below) if the VM is turned on, VAAI-NAS is bypassed as the “Fast File Clone” primitive only works on VMs which are powered off.

clone

So a simple way to test the performance benefits of VAAI on any platform (including Hyper-converged such as Nutanix, a Virtual Storage Appliance (VSA) such as Netapp Ontap Edge or traditional centralised SAN or NAS) is to clone a VM while powered on then shut-down the VM and clone it again.

I performed this test and the first clone with the VM powered on started at 1:17:23 PM and finished at 1:26:12 PM, so a total of 8 mins 49 seconds.

Next I shut down the VM and repeated the clone operation.cloneresults

As we can see in the above screen capture from the 2nd clone started at 1:26:49 PM and finished at 1:26:54 PM, so a total of 5 seconds.

The reason for the huge difference in the speed of the two clones is because VAAI-NAS “Fast File Clone” primitive offloaded the 2nd clone to the Nutanix platform (which runs as a VM on the ESXi host) which has intelligently cloned the VM (using metadata resulting in almost zero data creation) as opposed to 1st clone where VAAI-NAS was not used which resulted in the hypervisor and storage solution having to read 11.18GB of data (being the source VM – Admin01) and write a full copy of the same data resulting in effectively >22GB of data movement in the environment.

Now from a capacity savings perspective, a simple way to demonstrate the capacity savings of VAAI on any platform is to clone a VM multiple times and compare the before and after datastore statistics.

Before I performed this test I captured a baseline of the Management datastore as shown below.

BeforeCloningCapacityVMcount

The above highlighted areas show:

  • Virtual Machines and Templates as 83
  • Capacity 8.49TB
  • Provisioned Space 7.09TB
  • Free Space 7.01TB

I then cloned the Admin01 VM a total of 7 times.clone7vmsrecenttasks

Immediately following the last clone completing I took the below screen shot of the Management datastores statistics.

AfterCloningCapacityVMcount

The above highlighted areas in the updated datastore summary show:

  • Virtual Machines and Templates INCREASED by 7 to 90 (as I cloned 7 VMs)
  • Capacity remained the same at 8.49TB
  • Provisioned Space INCREASED to 7.29TB as we cloned 7 x ~40Gb VMs (Total of ~280GB)
  • Free Space REMAINED THE SAME at 7.01TB due to VAAI-NAS Fast File Clone primitive working with the Nutanix Distributed File System.

So VAAI-NAS allowed a VM of ~11GB of used storage (~40GB provisioned) to be cloned without using any significant additional disk space and the clones were each done in between 5 and 7 seconds each.

So some of the benefits VAAI-NAS provides to Nutanix (which some people would term as a VSA type solution) include:

  • Near instant VM cloning via vSphere Client/s (as shown above)
  • Near instant Horizon View Linked Clone deployments (VCAI) – Similar to example shown.
  • Near instant vCloud Director clones (via FAST Provisioning) – Similar to example shown.
  • Major capacity savings by using Intelligent cloning rather than Full Clones (As shown above)
  • Lower CPU overhead for both ESXi hosts AND Nutanix Controller VM (CVM)
  • Ability to create EagerZeroThick VMDKs on NFS (e.g.: To support Fault Tolerance & clustered workloads such as Oracle RAC)
  • Enhanced ability to get statistics on file sizes , capacity usage etc on NFS

In Summary:

Overall I would say that VMware have developed an excellent API in VAAI and Nutanix along with VSA providers having support for VAAI provides major advantages and value to our joint customers with VMware.

It would be broken logic NOT to leverage the advantages of VAAI regardless of storage type (VSA, Nutanix or traditional centralized SAN/NAS) and for the vast majority of vSphere deployments, any storage solution not supporting (or having issues/bugs with) VAAI will have significant downsides.

I am looking forward to ongoing developments from VMware such as vVols and VASA 2.0 to continue to enhance storage of vSphere solutions in the future.

I hope customers and architects now have the correct information to make the most effective design and purchasing recommendations to meet/exceed customer requirements.

Storage DRS and Nutanix – To use, or not to use, that is the question?

Storage DRS (SDRS) is an excellent feature which was released with vSphere 5.0 in late 2011. For those of you who are not familiar with SDRS I recommend reading the following article prior to reading the rest of this post as SDRS knowledge is assumed from now on.

Understanding VMware vSphere 5.1 Storage DRS

This post also assumes basic knowledge of the Nutanix platform, for those of you who are not familiar with Nutanix please review the following links prior to reading the remainder of this post.

About Nutanix | How Nutanix Works | 8 Strategies for a Modern Datacenter

Storage DRS & Nutanix – To use, or not to use, that is the question?

With Storage DRS (SDRS), both capacity and performance can be managed, but what should SDRS manage in a Nutanix environment?

Lets start with performance. SDRS can help ensure optimal performance of virtual machines by enabling the I/O metric for SDRS recommendations as shown in the screen shot below.

SDRSsettingsIOmetricCircledSmall

Once this is done, SDRS will evaluate I/O every 8 hours (by default) and where the configured latency threshold is exceeded, perform a cost/benefit analysis before deciding to make a migration recommendation or do nothing.

So the question is, does SDRS add value in a Nutanix environment from a performance perspective?

The Nutanix solution adopts the “Scale-out” methodology by having one (1) Nutanix Controller VM (CVM) per Nutanix Node (ESXi Host) and then presents NFS datastore/s to the vSphere cluster which are serviced by all CVMs. The CVMs use intelligent auto-tiering to ensure optimal performance. The way this works at a high level, is as follows.

Data is written to an SSD tier (either PCIe SSD such as Fusion-io or SATA SSD) before being migrated off to a SATA tier once the blocks are determined to be “Cold” and if/when required, promoted back the an SSD tier when they become “Hot” again for improved read performance.

As with other vendor storage solutions with auto tiering technologies (such as FAST-VP , FlashPools etc) the same recommendation around SDRS and the I/O metric is true for Nutanix, leave it disabled.

So, at this point we have concluded the I/O metric will be “Disabled”, lets move onto Capacity management.

The Nutanix solution presents large NFS datastore/s to the ESXi hosts (Nutanix nodes) which are shared across all ESXi hosts in one or more vSphere clusters.

When using SDRS, it can manage initial placement of a new Virtual machine based on the configured “Utilized Space” metric (shown below) to ensure there is not a capacity imbalance between the datastores in a datastore cluster, as well as move virtual machines around when new machines are provisioned to ensure the balance is maintained.

UtilizedSpaceSDRS

So this is a really good feature which I have and do recommend in several scenarios, however the Nutanix solution presents typical a small number of large NFS datastores to the vSphere cluster (or clusters) which are serviced by all Controller VMs (CVMs) in the Nutanix cluster. Using SDRS for initial placement does not add much (if any) value as the initial placement will almost always be on the same large NFS datastore.

Where actual physical capacity becomes an issue, space saving technologies such as compression can be enabled, or the environment can be granularly scaled by adding just a single additional Nutanix node which linearly scales the solution from both a capacity and performance perspective.

The only real choice is when you choose to present two (or more) datastores where one datastore leverage’s the Nutanix compression technology. This is a very easy scenario for a vSphere admin to choose the placement of a VM and is the same amount of administrative effort as choosing a datastore cluster which would be a collection of datastores either using compression, or not depending on the workloads.

As a result there is no advantage to using SDRS to manage utilized space.

In conclusion, Storage DRS is a great feature when used with storage arrays where performance does not scale linearly or provide intelligent tiering to address I/O bottlenecks and/or where your environment has large numbers of datastores where you need to actively manage capacity.

As performance and capacity management are intelligently managed natively by the Nutanix solution, the requirement (or benefit) provided by SDRS is negated, as a result there is no requirement or benefit for using SDRS with a Nutanix solution.

Related Articles

1. Example Architectural Decision – VMware DRS automation level for a Nutanix environment