NOS 4.5 Delivers Increased Read Performance from SATA

In a recent post I discussed how NOS 4.5 increases the effective SSD tier capacity by performing up-migrations on only the local extent as opposed to both RF copies within the Nutanix cluster. In addition to this significant improvement in usable SSD tier, in NOS 4.5 the read performance from the SATA tier has also received lots of attention from Nutanix engineers.

What the Solutions and Performance Engineering team have discovered and been testing is how we can improve SATA performance. Now ideally the active working set for VMs will fit within the SSD tier, and the changes discussed in my previous post dramatically improve the chances of that active working set fitting within the SSD tier.

But there are situation when reads to cold data still need to be serviced by the slow SATA drives. Nutanix uses Data Locality to ensure the hot data remains close to the application to deliver the lowest latency and overheads which improve performance, but in the case of SATA drives and the fact data is infrequently accessed from SATA means that reading from remote SATA drives can improve performance especially where the number of local SATA drives is limited (in some cases to only 2 or 4 drives).

Most Nutanix nodes have 2 x SSD and 4 x SATA so best case you will only see a few hundred IOPS from SATA as that is all they are physically capable of. To get around this issue.

NOS 4.5 introduces some changes to the way in which we select a replica to read an egroup from the HDD tier. Periodically NOS (re)calculate the average IO latencies of the all the replicas of a vdisk’s (replicas which have the vdisk’s egroups). We use this information to choose a replica as follows:

  1. If the latency of the local replica is less than a configurable threshold, read from the local replica.
  2. If the latency of the local replica is more than a configurable threshold, and the latency of the remote replica is more than that of the local replica, prefer the local replica.
  3. If the latency of the local replica is more than a configurable threshold and the remote replica is lower than the configurable threshold OR lower than the local copy, prefer the remote replica.

The diagram below shows an example of where the VM on Node A is performing random reads to data A and shortly thereafter data C. When requesting reads from data A the latency is below the threshold but when it requests data C, NOS detects that the latency of the local copy is higher than the remote copy and selects the remote replica to read from. As the below diagram shows, one possible outcome when reading multiple pieces of data is one read is served locally and the other is serviced remotely.

remotesatareads2

Now the obvious next question is “What about Data Locality”.

Data Locality is being maintained for the hot data which resides in SSD tier because reads from SSD are faster and have lower overheads on CPU/Network etc when read locally due to the speed of SSDs. For SATA reads which are typical >5ms the SATA drive itself is the bottleneck not the network, so by distributing the Reads across more SATA drives even if they are not local, results in better overall performance and lower latency.

Now if the SSD tier has not reached 75% all data will be within the SSD tier and will be served locally, the above feature is for situations where the SSD tier is 75% full and data is being tiered to SATA tier AND random reads are occurring to cold data OR data which will not fit in the SSD tier such as very large databases.

In addition NOS 4.5 detects if the read I/O is random or sequential, and if its sequential (which SATA performance much better at) then the up-migration of data has a higher threshold to meet before being migrated to SSD.

The result of these algorithm improvements (and the increased SSD tier effective capacity discussed earlier) and Nutanix In-line compression is higher performance over larger working sets which also exceed the capacity of the SSD tier.

Effectively NOS 4.5 is delivering a truly scale out solution for read I/O from SATA tier which means one VM can be reading from potentially all nodes in the cluster ensuring SATA performance for things like Business Critical Applications is both high and consistent. Combine that with NX-6035C storage only nodes, this means SATA read I/O can be scaled out as shown in the below diagram without scaling compute.

ScaleOutRemoteReads

 

As we can see above, the Storage only Nodes (NX-6035C) are delivering additional performance for read I/O from the SATA tier (as well as from the SSD tier).

NOS 4.5 Delivers Increased effective SSD tier capacity

In addition to the increased effective SSD (and SATA) tier capacity gained by using Erasure Coding (EC-X) which was announced at the Nutanix .NEXT conference earlier this year, the upcoming NOS (Nutanix Operating System) 4.5 is providing a yet another effective capacity increase for the SSD tier.

Here’s how it works:

The below 4 node cluster has 3 VMs actively using data (known as extents) represented by the A,B,C blocks. This is a very simplified example as VMs will have potentially hundreds or thousands of extents distributed throughout a cluster.

AllHotDataSSD

What we can see in the above diagram is two copies of each piece of data as this is an RF2 deployment. The VM on Node A is using extent A, the VM on Node B is using extent B and the VM on Node C is using extent C.

Because the VMs are using Extents A,B and C, they all remain within the SSD tier including the replicas distributed throughout the cluster. When these extents become cold they will be dynamically moved to the SATA tier.

What is changing in NOS 4.5 is the Nutanix tiering solution called ILM (Intelligent Lifecycle Management) now perform up-migrations (from SATA to SSD) on a per extent basis which means replicas are treated independent of each other. What this means is the hot extents will up-migrate to SSD on the node where the VM is running (via Data Locality) giving all flash performance while the replicas distributed throughout the cluster will remain in the SATA tier as shown below:

PerExtentUpMigrations

As we can see in the above diagram, all copies of A,B,C and D were in the SATA tier. Then the VM on node A started frequently reading from data A and the local extent is therefore up-migrate to SSD.

For the VM on node B, it started frequently accessing data D and B. Data D was up-migrated from local SATA and data B was up-migrated AND localized as it was residing on a remote node. The VM on node C also up-migrated from local SATA the same as VM on node A.

Now we can see that out of the 8 extents, we have 4 which have me up-migrated and localized (where required) and 4 which remain in the low cost SATA tier.

As a result the SSD tiers effective capacity is doubled for RF2 and tripled for RF3. So this means for customers using RF2, the active working set can potentially double while still providing all flash performance.

If data is frequently being overwritten NDFS will detect this and up-migrate both the local and remote copy/copies to ensure write I/O is always serviced by the SSD tier. The below diagram shows Data A being up-migrated to node C SSD tier ready to service the redundant replicas for any write I/O.

PerExtentUpMigrationsWriteIO

As typical mixed workload environments have a higher Read vs Write ratio e.g.: 70/30 the benefits of only up-migrating one extent when it becomes hot is effective for a large percentage of the I/O.

Even in the event the Read vs Write Ratio is reversed e.g.: 30/70 which is typical for VDI environments, the new ILM process will still provide a significant effective increase of the SSD tier by only up-migrating one out of two extents. It should be noted for VDI solutions, VAAI-NAS already provides huge data reduction savings thanks to intelligent cloning and as a result it is not uncommon to find large VDI deployments on Nutanix using only the SSD tier.

Summary:

NOS 4.5 delivers Double or Triple (for RF3) the effective SSD tier capacity in addition to data reduction savings from technologies such as deduplication, compression and Erasure Coding (EC-X). This feature is like most things with Nutanix is hypervisor agnostic!

Not bad for a free software upgrade huh!

Related Posts:

1. Scaling Hyper-converged solutions – Compute only.

2. Advanced Storage Performance Monitoring with Nutanix

3. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

4. Nutanix – Erasure Coding (EC-X) Deep Dive

5. Acropolis: VM High Availability (HA)

6. Acropolis: Scalability

7. NOS & Hypervisor Upgrade Resiliency in PRISM

Scaling Hyper-converged solutions – Compute only.

A quick bit of history on Nutanix, back in mid 2013 when I joined, in almost every meeting I went to, and presentation I gave, there was a common theme. People wanted to scale compute and storage at different rates.

Now this makes perfect sense, and this issue has long been addressed by a large range of node types which can be mixed in the same Nutanix cluster.

For example: NX3060 nodes with Dual Intel Haswell CPUs and ~2TB usable storage can be mixed with NX6060 nodes also running dual Intel Haswell CPUs but with ~8TB usable each.

Nutanix also has configure to order (CTO) nodes where size of SSDs and HDDs can be modified to suit customer requirements. So at this point I never have a challenge sizing for a customer workload as I have plenty of great options to choose from.

Another common question has been “How do I scale storage only?”. Nutanix has also addressed this in an intelligent way and as a result adding “Storage Only” nodes makes sense as I described in Scale Storage separately to Compute on Nutanix!

In recent months a new question has emerged and a small percentage of partners/customers have been asking about adding Compute only nodes (e.g.: Traditional ESXi hosts) to a Nutanix (or HCI) cluster.

My first question to these customers/partners is: Why?

The typical reply is something like “Because we need to add more VMs which have low storage requirements” or “Because we don’t need storage”.

Let’s look at these answers:

Firstly, my favourite one, “Because we don’t need storage”.

Is this really true, or do you mean the new VMs have low storage requirements. In almost all cases the truth is the new VMs have a small requirement for storage capacity and performance.

So next let’s look at the other common (and more realistic) situation:

“Because we need to add more VMs which have low storage requirements”

So this is very possible and something a HCI solution should cater for and for Nutanix we do. For example one of our most popular nodes is the NX-3050 or NX-3060 which are a compute heavy node with 2 sockets each with up to 24 physical CPU cores (Haswell) and 512GB RAM.

This node also comes with 2 x SSDs and 4 x SATA HDDs with a minimum usable capacity of approx 2TB (of which 20% is SSD).

So while the solution adds some capacity, its giving the added advantage of ensuring all the advantages of HCI while eliminating the complexity of a 3-tier architecture, which is why customers are flocking to HCI in the 1st place.

Even if the capacity is not required and the SSDs simply service the reads locally where required and increase the shared SSD tier of the cluster which means more write performance for workloads throughout the cluster. Sounds pretty good to me!

Does having an additional 4 x SATA drives really matter? Well from a cost perspective, its minimal cost and thanks to Disk Balancing, the SATA drives will hold some data (such as replicas) which lowers the overheads on other nodes, therefore improving resiliency and performance.

So there is lots of advantages to adding even a small amount of storage even if the new workloads don’t require most of it.

But for those of you who aren’t already convinced that adding some storage is advantageous, how about adding dual Intel Haswell CPUs and up to 512GB RAM just 1 x SSD to accelerate write I/O and serve what little storage locally that the VMs need and just 2 x SATA HDDs.

Nutanix has such a node, which is another option to scale high compute and very low storage.

Another question I get is: “Is the fact Nutanix can’t do this why you don’t recommend it?”

The answer is, Nutanix can add compute only, and we can actually do it very well and get very good performance, but its not HCI and it adds complexity which is not necessary which is why we don’t recommend (or Productise) this option.

Now let’s look at what adding compute only to HCI looks like?

warning-contents-may-offend_design-200x200 (1)

*Scroll down when ready!

V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V

HCInotHCI

 

Yuk! That looks like old school 3-tier stuff to me!

As the above shows, adding Compute Only to HCI basically means you have a non HCI solution for part of your workloads.

Non HCI workloads on compute only nodes would therefore:

  • Be running in the same setup as traditional 3-tier infrastructure
  • Have different performance than HCI based workloads
  • Loose the advantage of having compute + storage close together
  • Increase dependency on Network
  • Impact network utilization of HCI node
  • Impact benefits of HCI for the native HCI workloads and much more.

The industry has accepted HCI as they way of the future and while adding compute only nodes might sound nice at a high level, its just re-introducing the class 3-tier complexity and problems of the past.

Summary:

If you have already invested in HCI, you clearly understand the advantages and value of the solution. Adding compute only is not a true “value” its just a “perceived value”.

Adding “Compute only” is just adding complexity and moving away from the value HCI brings, so my advice, don’t make the mistake, but if you have, you now know the solution.

Invest in a compute+storage node (albeit at a higher CAPEX) and enjoy the continued value of HCI and improve performance and resiliency to your entire cluster! Now that’s real value (at a reasonable cost).

And just remember….

cheaper

Related Posts:

1. Acropolis Hypervisor (AHV) I/O Failover & Load Balancing

2. Advanced Storage Performance Monitoring with Nutanix

3. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

4. Nutanix – Erasure Coding (EC-X) Deep Dive

5. Acropolis: VM High Availability (HA)

6. Acropolis: Scalability

7. NOS & Hypervisor Upgrade Resiliency in PRISM