Nutanix Resiliency – Part 5 – Read I/O during CVM maintenance or failures?

In the earlier parts of this series we’ve talked about how ADSF can recovery quickly from a node failure by re-protecting data in a distributed manner across the cluster. We also covered how resiliency can be increased from Resiliency Factor 2 (RF2) to RF3 and even changed to a more space efficient Erasure Coding (EC-X) configuration all without interruption.

Now let’s cover the critically important topic of how VMs are impacted during Nutanix Controller VM (CVM) maintenance such as AOS upgrades OR during failures such as the CVM crashing or even being accidentally or maliciously turned off.

Let’s quickly cover the basics of how Nutanix ADSF writes and protects data.

Looking an the following diagram we see a three node cluster with a single Virtual Machine. The VM has written some data represented by a,b,c & d & under normal circumstances all writes will have one replica written to the host running the VM (in this case Node 1) and the other replica (or replicas in the case of RF3) distributed throughput the cluster based on disk fitness values. The disk fitness values (or what I call “Intelligent replica placement”) ensure data is placed in the most optimal place the first time.

RF2Overview

If one or more nodes are added to the cluster, the Intelligent replica placement will send proportionally more replicas to those nodes until the cluster is in a balanced state. In the very unlikely even no new writes are occurring, ADSF has a background disk balancing process which will balance the cluster as a low priority.

Now that we know the basics of how Nutanix protects data using multiple replicas (called “Resiliency Factor”) let’s talk about what happens during a Nutanix ADSF storage layer upgrade.

Upgrades are initiated by a one-click process and performed in rolling style one controller VM (CVM) at a time regardless of the configured Resiliency Factor and if Erasure Coding (EC-X) is used or not. The rolling upgrade put simply takes one CVM offline at a time, performs the upgrade, performs and self check and then rejoins the cluster and then repeats the process on the next CVM.

One of the many advantages of Nutanix decoupling the storage from the hypervisor (i.e.: not embedding storage into the kernel) is that upgrades and even storage layer failures do not impact the running Virtual machines.

VMs do not need to be restarted (i.e.: Like a HA event) nor do they need to migrate (e.g.: vMotion) to another node. VMs continue without interruption to storage traffic even when the local controller is offline for any reason.

If the local CVM is down for maintenance or due to failure, the I/O is dynamically re-directed throughout the cluster.

Let’s look at a Read I/O when the CVM local to a VM is offline (for any reason).

The local CVM being offline means the physical drives (NVMe, SSD, HDD etc) are not available meaning the local data (replicas) is unavailable.

All read I/O will be redirected and continue to function as it will now be served by all CVMs in the cluster.

ReadIOServedRemotelyWhenCVMifOffline

This maintenance/failure scenario could be compared to a 3 Tier architecture in that the node running the VM is not currently providing storage and is connecting to the storage over a network. But as Nutanix is a distributed architecture all nodes within the cluster service the reads meaning in the worst case scenario of a three node cluster, during a failure or maintenance Nutanix has an equivalent architecture to an optimally performing dual controller storage array.

Let’s cover that one more time, in the WORST case scenario where the smallest cluster has suffered a failure (or maintenance) causing the read IO to be served remotely, Nutanix in a degraded state is at worst equivalent to a compute node accessing a dual controller storage array in it’s OPTIMAL state.

If the Nutanix cluster was for example eight nodes and one node was performing maintenance or the CVM was down for any reason, seven nodes would be serving IO to the VMs on that node. This process is actually nothing new and something Nutanix has done for a long time. It’s described in more detail in Acropolis Hypervisor (AHV) I/O Failover & Load Balancing which was published in July 2015.

Once the local CVM is back online, Read I/O is once again serviced by the local CVM and the only remote reads which occur will be in the case where a copy of data does not exist on the local node. When remote read/s occur, the 1MB extent which holds the data being read will be localised to allow subsequent reads to be local. It’s critical to understand the process of localising the extent (replica) adds no additional overhead on the network compared to a remote read so localising benefits performance without additional overheads.

Summary:

  1. ADSF writes data on the node where the VM resides to ensure subsequent reads are local.
  2. Read I/O is serviced by the local CVM and when the Local CVM is unavailable for any reason the read I/O is serviced by all CVMs in the cluster in a distributed manner
  3. Virtual machines do not need to be failed over or evacuated from a node when the local CVM is offline due to maintenance or failure
  4. In the worst case scenario of a 3 node cluster and a CVM down, a virtual machine running on Nutanix has it’s traffic serviced by at least two storage controllers which is the best case scenario for a Server + Dual Controller Storage Array (3 Tier) architecture.
  5. In clusters larger than three, Virtual machines on Nutanix enjoy more storage controllers serving their read I/O than an optimal scenario for a Server + Dual Controller Storage Array (3 Tier) architecture.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Nutanix Resiliency – Part 3 – Node failure rebuild performance with RF3

In part 1 we discussed the ability of Nutanix AOS to rebuild Resiliency Factor 2 (RF2) from a node failure in a fast and efficient manner thanks to the Acropolis Distributed Storage Fabric (ADSF) while part 2 showed how a storage container can be converted from RF2 to RF3 to further improve resiliency and how fast the process completed.

As with Part 2, we’re using a 12 node cluster and the breakdown of disk usage per node is as follows:

NodeCapacityUsage12NodeCLusterRF3

The node I’ll be simulating a failure on has 5TB of disk usage which is very similar to the capacity usage in the node failure testing in Part 1. It should be noted that as the cluster is now only 12 nodes, there are less controllers to read/write from/too as compared to Part 1.

Next I accessed the IPM Interface of the node and performed a “Power Off – Immediate” operation to simulate the node failure.

The following shows the storage pool throughput for the node rebuild which completed re-protecting the 5TB of data in approx 30mins.

RebuildPerformanceAndCapacityUsageRF3_12Nodes

Looking at the results at first glance, re-protecting in around 30 mins for 5TB of data, especially on 5yo hardware is pretty impressive, especially compared to SANs and other HCI products, but I felt it should have been faster so I did some investigation.

I found that the cluster was in an imbalanced state at the time I simulated the node failure and therefore not all nodes could contribute to the rebuild (from a read perspective) like they would under normal circumstances because they had little/no data on them.

The reason the cluster was in the un-balanced state is due to my having been performing frequent/repeated node failure simulations and I did not wait for disk balancing to complete after adding nodes back to the cluster before simulating the node failure.

Usually a vendor would not post sub-optimal performance results, but I strongly feel that transparency is key and while unlikely, it is possible to get in situations where a cluster is unbalanced and if a node failure occurred during this unlikely scenario it’s important to understand how that may impact resiliency.

So I ensured the cluster was in a balanced state and then re-ran the test and the result is shown below:

RF3NodeFailureTest4.5TBnode

We can now see over 6GBps throughput compared to around 5Gbps, an improvement of over 1GBps, and a duration of approx 12mins. We also can see there was no drop in throughput as we previously saw in the unbalanced environment. This is due to all nodes being able to participate for the duration of the rebuild as they all had an even amount of data.

Summary:

  • Nutanix RF3 is vastly more resilient than RAID6 (or N+2) style architectures
  • ADSF performs continual disk scrubbing to detect and resolve underlying issues before they can cause data integrity issues
  • Rebuilds from drive or node failures are an efficient distributed operation using all drives and nodes in a cluster
  • A recovery from a >4.5TB node failure (in this case, the equivalent of 6 concurrent SSD failures) around 12mins
  • Unbalanced clusters still perform rebuilds in a distributed manner and can recover from failures in a short period of time
  • Clusters running in a normal balanced configuration can recover from failures even faster thanks to the distributed storage fabric built in disk balancing, intelligent replica placement and even distribution of data.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Nutanix Resiliency – Part 2 – Converting from RF2 to RF3

In part 1 we discussed the ability of Nutanix AOS to rebuild from a node failure in a fast and efficient manner thanks to the Acropolis Distributed Storage Fabric (ADSF). In part 2 I wanted to show how a storage container can be converted from RF2 to RF3 and the speed at which the operation can be completed.

For this testing, only 12 nodes exist within the cluster.

ClusterSize

Let’s start with the storage pool capacity usage.

RF2Usage

Here we see just over 50TB of storage usage across the cluster.

In converting to RF3, or put simply adding a third replica of all data, we need to ensure we have enough available capacity otherwise RF3 wont be in compliance.

Next we increase the Redundancy Factor for the cluster (and metadata) to RF3. This enables the cluster to support RF3 containers, and to survive at least two node failures from a metadata perspective.

ReducdancyFactorCLuster

Next we increase the desired Storage Container to RF3.

Once the container is set to RF3, curator will detect the cluster is not in compliance with the configured redundancy factor and kick of a background task to create the additional replicas.

In this case, we started with approx 50TB of data in the storage pool, so this task will need to create 50% more replicas so we should end up with around 75TB of data.

Let’s see how long it took the cluster to create 25TB of data to comply with the new Redundancy Factor.

RF2toRF3on12NodeCluster

Here we see throughput of over 7GBps and the process taking less than 3 hours, so approx 8.3TB per hour. It is important to note that the cluster remained fully resilient at an RF2 level throughout the whole process, and had new writes been happening during this phase, they would all be protected with RF3.

Below is a chart showing the storage pool capacity usage increasing in a very linear fashion throughout the operation.

StoragePoolCapacityGrowth

Had the cluster been larger, it is important to note this task would have performed faster, as ADSF is a truely distributed storage fabric and the more node, the more controllers than participate in all write activity. For a great example of the advantage of adding additional nodes check out Scale out performance testing with Nutanix Storage Only Nodes.

Once the operation was completed we can see the storage pool capacity usage is at the expected 75TB level.

StoragePoolCapacityWithRF3

For those who are interested how hard Nutanix ADSF can drive the physical drives, I pulled some stats during the compliance phase.

StargateExtentStoreStats

What we can see highlighted is that the physical drives are being driven at or close to their maximum and the read and write I/O is being performed across all drives, not just to a single cache drive and then offloaded to capacity drives like less intelligent HCI platforms.

Summary:

  • Nutanix ADSF can change between Redundancy levels (RF2 and RF3) on the fly
  • A compliance operation creating >25TB of data can complete in less than 3 hours (even on 5 year old equipment)
  • The compliance operation performed in a linear manner throughout the task.
  • A single Nutanix Controller VM (CVM) is efficient enough to drive 6 x physical SSDs at close to their maximum ability
  • ADSF reads and writes to all drives and does not use a less efficient cache and capacity style architecture.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums