Scaling Hyper-converged solutions – Compute only.

Posted on July 24, 2015 by Josh Odgers

A quick bit of history on Nutanix, back in mid 2013 when I joined, in almost every meeting I went to, and presentation I gave, there was a common theme. People wanted to scale compute and storage at different rates.

Now this makes perfect sense, and this issue has long been addressed by a large range of node types which can be mixed in the same Nutanix cluster.

For example: NX3060 nodes with Dual Intel Haswell CPUs and ~2TB usable storage can be mixed with NX6060 nodes also running dual Intel Haswell CPUs but with ~8TB usable each.

Nutanix also has configure to order (CTO) nodes where size of SSDs and HDDs can be modified to suit customer requirements. So at this point I never have a challenge sizing for a customer workload as I have plenty of great options to choose from.

Another common question has been “How do I scale storage only?”. Nutanix has also addressed this in an intelligent way and as a result adding “Storage Only” nodes makes sense as I described in Scale Storage separately to Compute on Nutanix!

In recent months a new question has emerged and a small percentage of partners/customers have been asking about adding Compute only nodes (e.g.: Traditional ESXi hosts) to a Nutanix (or HCI) cluster.

My first question to these customers/partners is: Why?

The typical reply is something like “Because we need to add more VMs which have low storage requirements” or “Because we don’t need storage”.

Let’s look at these answers:

Firstly, my favourite one, “Because we don’t need storage”.

Is this really true, or do you mean the new VMs have low storage requirements. In almost all cases the truth is the new VMs have a small requirement for storage capacity and performance.

So next let’s look at the other common (and more realistic) situation:

“Because we need to add more VMs which have low storage requirements”

So this is very possible and something a HCI solution should cater for and for Nutanix we do. For example one of our most popular nodes is the NX-3050 or NX-3060 which are a compute heavy node with 2 sockets each with up to 24 physical CPU cores (Haswell) and 512GB RAM.

This node also comes with 2 x SSDs and 4 x SATA HDDs with a minimum usable capacity of approx 2TB (of which 20% is SSD).

So while the solution adds some capacity, its giving the added advantage of ensuring all the advantages of HCI while eliminating the complexity of a 3-tier architecture, which is why customers are flocking to HCI in the 1st place.

Even if the capacity is not required and the SSDs simply service the reads locally where required and increase the shared SSD tier of the cluster which means more write performance for workloads throughout the cluster. Sounds pretty good to me!

Does having an additional 4 x SATA drives really matter? Well from a cost perspective, its minimal cost and thanks to Disk Balancing, the SATA drives will hold some data (such as replicas) which lowers the overheads on other nodes, therefore improving resiliency and performance.

So there is lots of advantages to adding even a small amount of storage even if the new workloads don’t require most of it.

But for those of you who aren’t already convinced that adding some storage is advantageous, how about adding dual Intel Haswell CPUs and up to 512GB RAM just 1 x SSD to accelerate write I/O and serve what little storage locally that the VMs need and just 2 x SATA HDDs.

Nutanix has such a node, which is another option to scale high compute and very low storage.

Another question I get is: “Is the fact Nutanix can’t do this why you don’t recommend it?”

The answer is, Nutanix can add compute only, and we can actually do it very well and get very good performance, but its not HCI and it adds complexity which is not necessary which is why we don’t recommend (or Productise) this option.

Now let’s look at what adding compute only to HCI looks like?

*Scroll down when ready!

V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V

Yuk! That looks like old school 3-tier stuff to me!

As the above shows, adding Compute Only to HCI basically means you have a non HCI solution for part of your workloads.

Non HCI workloads on compute only nodes would therefore:

Be running in the same setup as traditional 3-tier infrastructure
Have different performance than HCI based workloads
Loose the advantage of having compute + storage close together
Increase dependency on Network
Impact network utilization of HCI node
Impact benefits of HCI for the native HCI workloads and much more.

The industry has accepted HCI as they way of the future and while adding compute only nodes might sound nice at a high level, its just re-introducing the class 3-tier complexity and problems of the past.

Summary:

If you have already invested in HCI, you clearly understand the advantages and value of the solution. Adding compute only is not a true “value” its just a “perceived value”.

Adding “Compute only” is just adding complexity and moving away from the value HCI brings, so my advice, don’t make the mistake, but if you have, you now know the solution.

Invest in a compute+storage node (albeit at a higher CAPEX) and enjoy the continued value of HCI and improve performance and resiliency to your entire cluster! Now that’s real value (at a reasonable cost).

And just remember….

2. Advanced Storage Performance Monitoring with Nutanix

3. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

4. Nutanix – Erasure Coding (EC-X) Deep Dive

5. Acropolis: VM High Availability (HA)

6. Acropolis: Scalability

7. NOS & Hypervisor Upgrade Resiliency in PRISM

The Key to performance is Consistency

Posted on July 6, 2015 by Josh Odgers

In recent weeks I have been doing lots of proof of concepts and performance testing using tools such as Jetstress (with great success I might add).

What I have always told customers is to focus on choosing a solution which comfortably meets their performance requirements while also delivering consistent performance.

The key word here is consistency.

Many solutions can achieve very high peak performance especially when only testing cache performance, but this isn’t real world as I discussed in Peak Performance vs Real World Performance.

So with two Jetstress VMs on a 3 node Nutanix cluster (N+1 configuration) I configured Jetstress to create multiple databases which used about 85% of the available capacity per node. The nodes used were hybrid, meaning some SSD and some SATA drives.

What this means is the nodes have ~20% of data within the SSD tier and the bulk of the data residing within the SATA tier as shown in the Nutanix PRISM UI on the Storage tab as shown below.

As Jetstress performs I/O across all data concurrently, it means that things like caching and tiering become much less effective.

For this testing no tricks have been used such as de-duplicating Jetstress DBs, which are by design duplicates. Doing this would result in unrealistically high dedupe ratios where all data would be served from SSD/cache resulting in artificially high performance and low latency. That’s not how I roll, I only talk real performance numbers which customers can achieve in the real world.

In this post I am not going to talk about the actual IOPS result, the latency figures or the time it took to create the databases as I’m not interested in getting into performance bake offs. What I am going to talk about is the percentage difference in the following metrics between the nodes observed during these tests:

1. Time to create the databases : 1.73%

2. IOPS achieved : 0.44%

3. Avg Read Latency : 4.2%

As you can see the percentage difference between the nodes for these metrics is very low, meaning performance is very consistent across a Nutanix cluster.

Note: All testing was performed concurrently and background tasks performed by Nutanix “Curator” function such as ILM (Tiering) and Disk Balancing were all running during these tests.

What does this mean?

Running business critical workloads on the same Nutanix cluster does not cause any significant noisy neighbour types issues which can and do occur in traditional centralised shared storage solutions.

VMware have attempted to mitigate against this issue with technology such as Storage I/O Control (SIOC) and Storage DRS (SDRS) but these issues are natively eliminated thanks to the Nutanix scale out shared nothing architecture. (Nutanix Xtreme Computing Platform or XCP)

Customers can be confident that performance achieved on one node is repeatable as Nutanix clusters are scaled even with Business Critical applications with large working sets which easily exceed the SSD tier.

It also means performance doesn’t “fall of the cache cliff” and become inconsistent, which has long been a fear with systems dependant on cache for performance.

Nutanix has chosen not to rely on caching to achieve high read/write performance, instead we to tune our defaults for consistent performance across large working sets and to ensure data integrity which means we commit the writes to persistent media before acknowledging writes and perform checksums on all read and write I/O. This is key for business critical applications such as MS SQL, MS Exchange and Oracle.

Advanced Storage Performance Monitoring with Nutanix

Posted on June 25, 2015 by Josh Odgers

Nutanix provides excellent performance monitoring and analytic capabilities through our HTML 5 based PRISM UI, but what if you want to delve deeper into the performance of a specific business critical application?

Nutanix also provides advanced storage performance monitoring and workload profiling through port 2009 on any CVM which shows very granular details for Virtual disks.

By default, Nutanix secures our CVM and the http://CVM_IP:2009 page is not accessible, but for advanced troubleshooting this can be enabled by using the following command.

sudo iptables -t filter -A WORLDLIST -p tcp -m tcp –dport 2009 -j ACCEPT

When accessing the 2009 page (which is part of the Nutanix process called “Stargate”) you will see things like Extent (In Memory Read) cache usages and hits as well as much more.

On the main 2009 page you will see a section called “Hosted VDisks” (shown below) which shows all the current VDisks (equivalent of a VMDK in ESXi) which are currently running on that node.

The Hosted VDisks shows high level details about the VDisk such as Outstanding Operations, capacity usage, Read/Write breakdown and how much data is in the OpLog (Persistent Write Cache).

If you need more information, you can click on the “VDisk Id” and you will get to a page titled “VDisk XXXXX Stats” where the XXXXX is the VDisk ID.

The below is some of the information which can be discovered in the VDisk Stats Page.

VDisk Working Set Size (WWS)

The working set size can be thought of as the data which you would ideally want to fit within the SSD tier of a Nutanix node, which would result in all-flash type performance.

In the below example, in the last 2mins, the VDisk had a combined (or Union) working set of 6.208GB and over the last 1hr over 111GB.

VDisk Read Source

The Read Source is simply what tier of storage is servicing the VDisks IO requests. In the below example, 41% was from Extent Cache (In Memory), 7% was from the SSD Extent Store and 52% was from the SATA Extent Store.

In the above example, this was an Exchange 2013 workload where the total dataset was approx 5x the size of the SSD tier. The important point here is its not always possible to have all data in the SSD tier, but its critical to ensure consistent performance. If 90% was being served from SATA and performance was not acceptable, you could use this information to select a better node to migrate (vMotion) the VM too, or help choose to purchase a new node.

VDisk Write Destination

The Write Destination is fairly self explanatory, if its Oplog it means its Random IO and its being written to SSD, if its straight to the extent store (SSD) it means the IO is either sequential, OR in rare cases the OpLog is being bypassed if the SSD tier reached 95% full (which is generally prevented by Nutanix ILM tiering process).

VDisk Write Size Distribution

The Write Size Distribution is key to determining things like the Windows Allocation Size when formatting drives as well as understanding the workload.

VDisk Read Size Distribution

The Read Size Distribution is similar to Write Size in that its key to determining things like the Windows Allocation Size when formatting drives as well as understanding the workload. In this case, a 64k allocation size would be ideal as both the Write (shown above) and the Read (below) are >32K and <64K 86% of the time. (Which is expected as this was an Exchange 2013 workload).

VDisk Write Latency

The Write Latency shows the percentage of Write I/O which are serviced within the latency ranges shown. In this case, 52% of writes are sub-millisecond. It also shows for this vDisk 1% of IO being outliers being served between 5-10ms. This is something that outside of a lab, if the outliers were a significant percentage that could be investigated to ensure the VM disk configuration (e.g.: PVSCSI and number of VMDKs) is optimal.

VDisk Ops and Randomness

Here we see the number of IOPS, the Read/Write split, MB/s and the split between Random and Sequential.

Summary

For any enterprise grade storage solution, it is important that performance monitoring be easy as it is with Nutanix via PRISM UI, but also to be able to quickly and easily dive deep into very granular details about a specific VM or VDisk. The above shows just a glimpse of the information which is tracked by default for all VDisks allowing customers , partners and Nutanix support to quickly and easily monitor & profile workloads.

Importantly these capabilities are hypervisor agnostic giving customers the same capabilities no matter what choice/s they make.

2. Acropolis Hypervisor (AHV) I/O Failover & Load Balancing

3. Advanced Storage Performance Monitoring with Nutanix

4. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

5. Nutanix – Erasure Coding (EC-X) Deep Dive

6. Acropolis: VM High Availability (HA)

7. Acropolis: Scalability

8. NOS & Hypervisor Upgrade Resiliency in PRISM