NOS 4.5 Delivers Increased Read Performance from SATA

In a recent post I discussed how NOS 4.5 increases the effective SSD tier capacity by performing up-migrations on only the local extent as opposed to both RF copies within the Nutanix cluster. In addition to this significant improvement in usable SSD tier, in NOS 4.5 the read performance from the SATA tier has also received lots of attention from Nutanix engineers.

What the Solutions and Performance Engineering team have discovered and been testing is how we can improve SATA performance. Now ideally the active working set for VMs will fit within the SSD tier, and the changes discussed in my previous post dramatically improve the chances of that active working set fitting within the SSD tier.

But there are situation when reads to cold data still need to be serviced by the slow SATA drives. Nutanix uses Data Locality to ensure the hot data remains close to the application to deliver the lowest latency and overheads which improve performance, but in the case of SATA drives and the fact data is infrequently accessed from SATA means that reading from remote SATA drives can improve performance especially where the number of local SATA drives is limited (in some cases to only 2 or 4 drives).

Most Nutanix nodes have 2 x SSD and 4 x SATA so best case you will only see a few hundred IOPS from SATA as that is all they are physically capable of. To get around this issue.

NOS 4.5 introduces some changes to the way in which we select a replica to read an egroup from the HDD tier. Periodically NOS (re)calculate the average IO latencies of the all the replicas of a vdisk’s (replicas which have the vdisk’s egroups). We use this information to choose a replica as follows:

  1. If the latency of the local replica is less than a configurable threshold, read from the local replica.
  2. If the latency of the local replica is more than a configurable threshold, and the latency of the remote replica is more than that of the local replica, prefer the local replica.
  3. If the latency of the local replica is more than a configurable threshold and the remote replica is lower than the configurable threshold OR lower than the local copy, prefer the remote replica.

The diagram below shows an example of where the VM on Node A is performing random reads to data A and shortly thereafter data C. When requesting reads from data A the latency is below the threshold but when it requests data C, NOS detects that the latency of the local copy is higher than the remote copy and selects the remote replica to read from. As the below diagram shows, one possible outcome when reading multiple pieces of data is one read is served locally and the other is serviced remotely.

remotesatareads2

Now the obvious next question is “What about Data Locality”.

Data Locality is being maintained for the hot data which resides in SSD tier because reads from SSD are faster and have lower overheads on CPU/Network etc when read locally due to the speed of SSDs. For SATA reads which are typical >5ms the SATA drive itself is the bottleneck not the network, so by distributing the Reads across more SATA drives even if they are not local, results in better overall performance and lower latency.

Now if the SSD tier has not reached 75% all data will be within the SSD tier and will be served locally, the above feature is for situations where the SSD tier is 75% full and data is being tiered to SATA tier AND random reads are occurring to cold data OR data which will not fit in the SSD tier such as very large databases.

In addition NOS 4.5 detects if the read I/O is random or sequential, and if its sequential (which SATA performance much better at) then the up-migration of data has a higher threshold to meet before being migrated to SSD.

The result of these algorithm improvements (and the increased SSD tier effective capacity discussed earlier) and Nutanix In-line compression is higher performance over larger working sets which also exceed the capacity of the SSD tier.

Effectively NOS 4.5 is delivering a truly scale out solution for read I/O from SATA tier which means one VM can be reading from potentially all nodes in the cluster ensuring SATA performance for things like Business Critical Applications is both high and consistent. Combine that with NX-6035C storage only nodes, this means SATA read I/O can be scaled out as shown in the below diagram without scaling compute.

ScaleOutRemoteReads

 

As we can see above, the Storage only Nodes (NX-6035C) are delivering additional performance for read I/O from the SATA tier (as well as from the SSD tier).

NOS 4.5 Delivers Increased effective SSD tier capacity

In addition to the increased effective SSD (and SATA) tier capacity gained by using Erasure Coding (EC-X) which was announced at the Nutanix .NEXT conference earlier this year, the upcoming NOS (Nutanix Operating System) 4.5 is providing a yet another effective capacity increase for the SSD tier.

Here’s how it works:

The below 4 node cluster has 3 VMs actively using data (known as extents) represented by the A,B,C blocks. This is a very simplified example as VMs will have potentially hundreds or thousands of extents distributed throughout a cluster.

AllHotDataSSD

What we can see in the above diagram is two copies of each piece of data as this is an RF2 deployment. The VM on Node A is using extent A, the VM on Node B is using extent B and the VM on Node C is using extent C.

Because the VMs are using Extents A,B and C, they all remain within the SSD tier including the replicas distributed throughout the cluster. When these extents become cold they will be dynamically moved to the SATA tier.

What is changing in NOS 4.5 is the Nutanix tiering solution called ILM (Intelligent Lifecycle Management) now perform up-migrations (from SATA to SSD) on a per extent basis which means replicas are treated independent of each other. What this means is the hot extents will up-migrate to SSD on the node where the VM is running (via Data Locality) giving all flash performance while the replicas distributed throughout the cluster will remain in the SATA tier as shown below:

PerExtentUpMigrations

As we can see in the above diagram, all copies of A,B,C and D were in the SATA tier. Then the VM on node A started frequently reading from data A and the local extent is therefore up-migrate to SSD.

For the VM on node B, it started frequently accessing data D and B. Data D was up-migrated from local SATA and data B was up-migrated AND localized as it was residing on a remote node. The VM on node C also up-migrated from local SATA the same as VM on node A.

Now we can see that out of the 8 extents, we have 4 which have me up-migrated and localized (where required) and 4 which remain in the low cost SATA tier.

As a result the SSD tiers effective capacity is doubled for RF2 and tripled for RF3. So this means for customers using RF2, the active working set can potentially double while still providing all flash performance.

If data is frequently being overwritten NDFS will detect this and up-migrate both the local and remote copy/copies to ensure write I/O is always serviced by the SSD tier. The below diagram shows Data A being up-migrated to node C SSD tier ready to service the redundant replicas for any write I/O.

PerExtentUpMigrationsWriteIO

As typical mixed workload environments have a higher Read vs Write ratio e.g.: 70/30 the benefits of only up-migrating one extent when it becomes hot is effective for a large percentage of the I/O.

Even in the event the Read vs Write Ratio is reversed e.g.: 30/70 which is typical for VDI environments, the new ILM process will still provide a significant effective increase of the SSD tier by only up-migrating one out of two extents. It should be noted for VDI solutions, VAAI-NAS already provides huge data reduction savings thanks to intelligent cloning and as a result it is not uncommon to find large VDI deployments on Nutanix using only the SSD tier.

Summary:

NOS 4.5 delivers Double or Triple (for RF3) the effective SSD tier capacity in addition to data reduction savings from technologies such as deduplication, compression and Erasure Coding (EC-X). This feature is like most things with Nutanix is hypervisor agnostic!

Not bad for a free software upgrade huh!

Related Posts:

1. Scaling Hyper-converged solutions – Compute only.

2. Advanced Storage Performance Monitoring with Nutanix

3. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

4. Nutanix – Erasure Coding (EC-X) Deep Dive

5. Acropolis: VM High Availability (HA)

6. Acropolis: Scalability

7. NOS & Hypervisor Upgrade Resiliency in PRISM

The Key to performance is Consistency

In recent weeks I have been doing lots of proof of concepts and performance testing using tools such as Jetstress (with great success I might add).

What I have always told customers is to focus on choosing a solution which comfortably meets their performance requirements while also delivering consistent performance.

The key word here is consistency.

Many solutions can achieve very high peak performance especially when only testing cache performance, but this isn’t real world as I discussed in Peak Performance vs Real World Performance.

So with two Jetstress VMs on a 3 node Nutanix cluster (N+1 configuration) I configured Jetstress to create multiple databases which used about 85% of the available capacity per node. The nodes used were hybrid, meaning some SSD and some SATA drives.

What this means is the nodes have ~20% of data within the SSD tier and the bulk of the data residing within the SATA tier as shown in the Nutanix PRISM UI on the Storage tab as shown below.

Tierusage

As Jetstress performs I/O across all data concurrently, it means that things like caching and tiering become much less effective.

For this testing no tricks have been used such as de-duplicating Jetstress DBs, which are by design duplicates. Doing this would result in unrealistically high dedupe ratios where all data would be served from SSD/cache resulting in artificially high performance and low latency. That’s not how I roll, I only talk real performance numbers which customers can achieve in the real world.

In this post I am not going to talk about the actual IOPS result, the latency figures or the time it took to create the databases as I’m not interested in getting into performance bake offs. What I am going to talk about is the percentage difference in the following metrics between the nodes observed during these tests:

1. Time to create the databases : 1.73%

2. IOPS achieved : 0.44%

3. Avg Read Latency : 4.2%

As you can see the percentage difference between the nodes for these metrics is very low, meaning performance is very consistent across a Nutanix cluster.

Note: All testing was performed concurrently and background tasks performed by Nutanix “Curator” function such as ILM (Tiering) and Disk Balancing were all running during these tests.

What does this mean?

Running business critical workloads on the same Nutanix cluster does not cause any significant noisy neighbour types issues which can and do occur in traditional centralised shared storage solutions.

VMware have attempted to mitigate against this issue with technology such as Storage I/O Control (SIOC) and Storage DRS (SDRS) but these issues are natively eliminated thanks to the Nutanix scale out shared nothing architecture. (Nutanix Xtreme Computing Platform or XCP)

Customers can be confident that performance achieved on one node is repeatable as Nutanix clusters are scaled even with Business Critical applications with large working sets which easily exceed the SSD tier.

It also means performance doesn’t “fall of the cache cliff” and become inconsistent, which has long been a fear with systems dependant on cache for performance.

Nutanix has chosen not to rely on caching to achieve high read/write performance, instead we to tune our defaults for consistent performance across large working sets and to ensure data integrity which means we commit the writes to persistent media before acknowledging writes and perform checksums on all read and write I/O. This is key for business critical applications such as MS SQL, MS Exchange and Oracle.