Nutanix Resiliency – Part 3 – Node failure rebuild performance with RF3

In part 1 we discussed the ability of Nutanix AOS to rebuild Resiliency Factor 2 (RF2) from a node failure in a fast and efficient manner thanks to the Acropolis Distributed Storage Fabric (ADSF) while part 2 showed how a storage container can be converted from RF2 to RF3 to further improve resiliency and how fast the process completed.

As with Part 2, we’re using a 12 node cluster and the breakdown of disk usage per node is as follows:

NodeCapacityUsage12NodeCLusterRF3

The node I’ll be simulating a failure on has 5TB of disk usage which is very similar to the capacity usage in the node failure testing in Part 1. It should be noted that as the cluster is now only 12 nodes, there are less controllers to read/write from/too as compared to Part 1.

Next I accessed the IPM Interface of the node and performed a “Power Off – Immediate” operation to simulate the node failure.

The following shows the storage pool throughput for the node rebuild which completed re-protecting the 5TB of data in approx 30mins.

RebuildPerformanceAndCapacityUsageRF3_12Nodes

Looking at the results at first glance, re-protecting in around 30 mins for 5TB of data, especially on 5yo hardware is pretty impressive, especially compared to SANs and other HCI products, but I felt it should have been faster so I did some investigation.

I found that the cluster was in an imbalanced state at the time I simulated the node failure and therefore not all nodes could contribute to the rebuild (from a read perspective) like they would under normal circumstances because they had little/no data on them.

The reason the cluster was in the un-balanced state is due to my having been performing frequent/repeated node failure simulations and I did not wait for disk balancing to complete after adding nodes back to the cluster before simulating the node failure.

Usually a vendor would not post sub-optimal performance results, but I strongly feel that transparency is key and while unlikely, it is possible to get in situations where a cluster is unbalanced and if a node failure occurred during this unlikely scenario it’s important to understand how that may impact resiliency.

So I ensured the cluster was in a balanced state and then re-ran the test and the result is shown below:

RF3NodeFailureTest4.5TBnode

We can now see over 6GBps throughput compared to around 5Gbps, an improvement of over 1GBps, and a duration of approx 12mins. We also can see there was no drop in throughput as we previously saw in the unbalanced environment. This is due to all nodes being able to participate for the duration of the rebuild as they all had an even amount of data.

Summary:

  • Nutanix RF3 is vastly more resilient than RAID6 (or N+2) style architectures
  • ADSF performs continual disk scrubbing to detect and resolve underlying issues before they can cause data integrity issues
  • Rebuilds from drive or node failures are an efficient distributed operation using all drives and nodes in a cluster
  • A recovery from a >4.5TB node failure (in this case, the equivalent of 6 concurrent SSD failures) around 12mins
  • Unbalanced clusters still perform rebuilds in a distributed manner and can recover from failures in a short period of time
  • Clusters running in a normal balanced configuration can recover from failures even faster thanks to the distributed storage fabric built in disk balancing, intelligent replica placement and even distribution of data.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Dare2Compare Part 4 : HPE provides superior resiliency than Nutanix?

As discussed in Part 1, we have proven HPE have made false claims about Nutanix snapshot capabilities as part of the #HPEDare2Compare twitter campaign.

In part 2, I explained how HPE/Simplivity’s 10:1 data reduction HyperGuarantee is nothing more than smoke and mirrors and that most vendors can provide the same if not greater efficiencies, even without hardware acceleration.

In part 3, I corrected HPE on their false claim that Nutanix cannot support dedupe without 8vCPUs and in part 4, I will respond to the claim (below) that Nutanix has less resiliency than HPE Simplivity 380.

To start with, the biggest causes of data loss, downtime, outages etc in my experience are caused by human error. From poor design, improper use of a product, poor implementation/validation and a lack of operations procedures or discipline to follow procedures, the number of times I’ve seen properly designed solutions have issues I can count on one hand.

Those rare situations have came down to multiple concurrent failures at different levels of the solution (e.g.: Infrastructure, Application, OS etc), not just things like one or more drive or server failures.

None the less, HPE Simplivity are commonly targeting Resiliency Factor 2 (RF2) and claiming it not to be resilient because they lack a basic understanding of the Acropolis Distributed Storage Fabric and how it distributes data, rebuilds from failures and therefore how resilient it is.

RF2 is often mistakenly compared to RAID 5, where a single drive failure takes a long time to rebuild and subsequent failures during rebuilds are not uncommon which would lead to a data loss scenario (for RAID 5).

Lets talk about some failure scenarios comparing HPE Simplivity to Nutanix.

Note: The below information is accurate to the best of my knowledge and testing, experience with both products.

When is a write acknowledged to the Virtual machine

HPE Simplivity – They use what they refer to as an Omnistack Accelerator card (OAC) which uses “Super capacitors to provide power to the NVRAM upon a power loss”. When a write hits the OAC it is then acknowledged to the VM. It is assumed or even likely that the capacitors will provide sufficient power to commit the writes persistently to flash but the fact is that writes are acknowledged BEFORE it is committed to persistent media. HPE will surely argue the OAC is persistent, but until the data is on something such as a SATA-SSD drive I do not consider it persistent and invite you to ask your trusted advisor/s their option because this is a grey area at best.

This can be confirmed on Page 29 of the SimpliVity Hyperconverged Infrastructure Technology Overview:

OACPowerLossLol

Nutanix – Writes are only acknowledged to the Virtual Machine when the write IO has been checksummed and confirmed written to persistent media (e.g.: SATA-SSD) on the number of nodes/drives based on the configured Resiliency Factor (RF).

Writes are never written to RAM or any other non persistent media and at any stage you can pull the power from a Nutanix node/block/cluster and 100% of the data will be in a consistent state. i.e.: It was written and acknowledged, or it was not written and therefore not acknowledged.

The fact Nutanix only acknowledges writes when data is written to persistent media on two or more hosts makes the platform compliant with FUA and Write Through which for HPE SVT, in the best case is dependant on power protection (UPS and/or OAC Capacitors) means Nutanix is more resilient (less risk) and has a higher level of data integrity than the HPE SVT product.

Checkout “Ensuring Data Integrity with Nutanix – Part 2 – Forced Unit Access (FUA) & Write Through” for more information and this will explain how Nutanix is compliant to critical data integrity protocols such as FUA and Write through and you can make your mind up if the HPE product is or not. Hint: A product is not compliant to FUA unless data is written to persistent media before acknowledgement.

Single Drive (NVMe/SSD/HDD) failure

HPE Simplivity – Protects data with RAID 6 (or RAID 5 on small nodes) + Replication (2 copies). A single drive failure causes a RAID rebuild which is a medium/high impact activity for the RAID group. RAID rebuilds are well known to be slow, this is one reason why HPE chooses (and wisely so) to use low capacity spindles to minimise the impact of RAID rebuilds. But this choice to use RAID and smaller drives has implications around cost/capacity/rack unit/power/cooling and so on.

Nutanix – Protects data with configurable Replication Factor (2 or 3 copies, or N+1 and N+2) along with rack unit (block) awareness. A single drive failure causes a distributed rebuild of the data contained on the failed drive across all nodes within the cluster. This distributed rebuild is evenly balanced throughout the cluster for low impact and faster time to recover. This allows Nutanix to support large capacity spindles, such as 8TB SATA.

Two concurrent drive (NVMe/SSD/HDD) failures *Same Node

HPE Simplivity – RAID 6 + Replication (2 copies) supports the loss of two drive failures and as with a single drive failure causes a RAID rebuild which is a medium/high impact activity for the RAID group.

Nutanix – Two drive failure causes a distributed rebuild of the data contained on the failed drives across all nodes within the cluster. This distributed rebuild is evenly balanced throughout the cluster for low impact and faster time to recover. This allows Nutanix to support large capacity spindles, such as 8TB SATA. No data is lost even when using Resiliency Factor 2 (which is N+1), despite what HPE claims. This is an example of the major advantage Nutanix Acropolis Distributed File System has over the RAID and mirroring type architecture of HPE SVT.

Three concurrent drive (NVMe/SSD/HDD) failures *Same Node

HPE Simplivity – RAID 6 + Replication (2 copies) supports the loss of only two drives per RAID group, at this stage the RAID group has failed and all data must be rebuilt.

Nutanix – Three drive failures again just causes a distributed rebuild of the data contained on the failed drives (in this case, 3) across all nodes within the cluster. This distributed rebuild is evenly balanced throughout the cluster for low impact and faster time to recover. This allows Nutanix to support large capacity spindles, such as 8TB SATA. No data is lost even when using Resiliency Factor 2 (which is N+1). Again, despite what HPE claims. This is an example of the major advantage Nutanix Acropolis Distributed File System has over the RAID and mirroring type architecture of HPE SVT.

Four or more concurrent drive (NVMe/SSD/HDD) failures *Same Node

HPE Simplivity – The RAID 6 + Replication (2 copies) supports the loss of only two drives per RAID group, any failures 3 or more result in a failure RAID group and a total rebuild of the data is required.

Nutanix – Nutanix can support N-1 drive failures per node, meaning in a 24 drive system, such as the NX-8150, 23 drives can be lost concurrently without the node going offline and without any data loss. The only caveat is the lone surviving drive for a hybrid platform must be an SSD. This is an example of the major advantage Nutanix Acropolis Distributed File System has over the RAID and mirroring type architecture of HPE SVT.

Next let’s cover off failure scenarios across multiple nodes.

Two concurrent drive (NVMe/SSD/HDD) failures in the same cluster.

HPE Simplivity – RAID 6 protects from 2 drive failures locally perRAID group whereas Replication (2 copies) supports the loss of one copy (N-1). Assuming the RAID groups are intact, data would not be lost.

Nutanix – Nutanix has configurable resiliency (Resiliency Factor) of either 2 copies (RF2) or three copies (RF3). Using RF3, under any two drive failure scenario there is no data loss and it causes a distributed rebuild of the data contained on the failed drives across all nodes within the cluster.

When using RF2 and block (rack unit) awareness, in the event two or more drives fail within a block (which is up to 4 nodes of 24 SSDs/HDDs), there is no data loss. In fact, in this configuration Nutanix can support the loss of up to 24 drives concurrently e.g.: 4 entire nodes and 24 drives without data loss/unavailability.

When using RF3 and block awareness, Nutanix can support the loss of up to 48 drives concurrently e.g.: 8 entire nodes and 48 drives without data loss/unavailability.

Under no circumstances can HPE Simplivity support the loss of ANY 48 drives (e.g.: 2 HPE SVT nodes w/ 24 drives each) and maintain data availability.

This is another example of the major advantage Nutanix Acropolis Distributed File System has over the RAID and mirroring type architecture of HPE SVT. Nutanix distributes all data throughout the ADSF cluster, which is something HPE SVT cannot do which impacts both performance and resiliency.

Two concurrent node (NVMe/SSD/HDD) failures in the same cluster.

HPE Simplivity – If the two HPE SVT nodes mirroring the data both go offline, you have data unavailability at best, with data loss at worst. As HPE SVT is not a cluster, (note the careful use of the term “Federation”) it scales essentially in pairs and each pair cannot fail concurrently.

Nutanix – With RF3 even without the use of block awareness, any two nodes and all drives within those nodes can be lost, with no data unavailability.

Three or more concurrent node (NVMe/SSD/HDD) failures in the same cluster.

HPE Simplivity – As previously discussed, HPE SVT cannot support the loss of any two nodes, so three or more makes matters worse.

Nutanix – With RF3 and block awareness, up to eight (yes 8!!) can be lost along with all drives within those nodes, with no data unavailability. That’s up to 48 SSD/HDDs concurrently failing without data loss.

So we can clearly see Nutanix provides a highly resilient platform and there are numerous configurations which ensure two drive failures do not cause data loss despite what the HPE campaign suggests.

https://twitter.com/HPE_SimpliVity/status/872073088569139200

The above tweet would be like me configuring a HPE Proliant server with RAID 5 and complaining HPE lost my data when two drive fails, it’s just ridiculous.

The key point here is when deploying any technology to understand your requirements and configure the underlying platform to meet/exceed your resiliency requirements.

Installation/Configuration

HPE Simplivity – Dependant on vCenter.

Nutanix – Uses PRISM which is a fully distributed HTML 5 GUI with no external dependancies regardless of Hypervisor choice (ESXi, AHV, Hyper-V and XenServer). In the event any hypervisor management tool (e.g.: vCenter) is down, PRISM is fully functional.

Management (GUI)

HPE Simplivity – Uses a vCenter backed GUI. If vCenter is down, Simplivity cannot be fully managed. In the event a vCenter goes down, best case scenario vCenter HA is used, then management will have a short interruption.

Nutanix – Uses PRISM which is a fully distributed HTML 5 GUI with no external dependancies regardless of Hypervisor choice (ESXi, AHV, Hyper-V and XenServer). In the event any hypervisor management tool (e.g.: vCenter) is down, PRISM is fully functional.

In the event of a node/s failing, PRISM being a distributed management layer continues to operate.

Data Availability:

HPE Simplivity – RAID 6 (or RAID 60) + Replication (2 copies), Deduplication and Compression for all data. Not configurable.

Nutanix – Configurable resiliency and data reduction with:

  1. Resiliency Factor 2 (RF2)
  2. Resiliency Factor 3 (RF3)
  3. Resiliency Factor 2 with Block Awareness
  4. Resiliency Factor 3 with Block Awareness
  5. Erasure Coding / Deduplication / Compression in any combination across all resiliency types.

Key point:

Nutanix can scale out with compute+storage OR storage only nodes, in either case, resiliency of the cluster is increased as all nodes (or better said, Controllers) in our distributed storage fabric (ADSF) help with the distributed rebuild in the event of drive/s or node/s failures. Therefore restoring the cluster to a fully resilient state faster, to therefore be able to support subsequent failures.

HPE Simplivity – Due to HPE SVTs platform not being a distributed file system, and working in a mirror style configuration, adding additional nodes to the “per datacenter” limit of eight (8) does not increase resiliency. As such the platform does not improve as it grows which is a strength of the Nutanix platform.

Summary:

Thanks to our Acropolis Distributed Storage Fabric (ADSF) and without the use of legacy RAID technology, Nutanix can support:

  1. Equal or more concurrent drive failures per node than HPE Simplivity
  2. Equal or more concurrent drive failures per cluster than HPE Simplivity
  3. Equal or more concurrent node failures than HPE Simplivity
  4. Failure of hypervisor management layer e.g.: vCenter with full GUI functionality

Nutanix also has the follow capabilities over and above the HPE SVT offering:

  1. Configurable resiliency and data reduction on a per vDisk level
  2. Nutanix resiliency/recoverability improves as the cluster grows
  3. Nutanix does not require any UPS or power protection to be compliant with FUA & Write Through

HPE SVT is less resilient during the write path because:

  1. HPE SVT acknowledge writes before committing data to persistent media (by their own admission)

Return to the Dare2Compare Index:

Microsoft Exchange on Nutanix Best Practice Guide

I am pleased to announce that the Best Practice guide for Microsoft Exchange on Nutanix is released and can be found here.

For me deploying MS Exchange on Nutanix with vSphere combines best of breed application level resiliency (in the form of Exchange Database Availability Groups), infrastructure and hypervisor technologies to provide an infrastructure with not only high performance, but with industry leading scalability, no silos and very high efficiency & resiliency.

All of this leads to overall lower CAPEX/OPEX for customers.

In summary by Virtualizing MS Exchange on Nutanix, customers realize several key benefits including:

  • Ability to use a standard platform for all workloads in the datacenter, thus allowing the removal of legacy silos resulting in lower overall cost, and increased operational efficiencies.
    • An example of this is no disruption to MS Exchange users when performing Nutanix / Hypervisor or HW maintenance
  • A highly resilient , scalable and flexible MS Exchange deployment.
  • Reducing the number of Exchange Mailbox servers required to maintain 4 copies of Exchange data thanks to the combination of NDFS + DAG. (2 copies at NDFS layer / 2 copies at DAG layer)
  • Eliminate the need for large / costly refresh cycles of HW as individual nodes can be added and removed non disruptively.
  • Simplified architecture, no need for complex sizing architecture or risk of over sizing day 1, start small and scale VMs, Compute or storage if/when required.
  • No dependency of specific HW, Exchange VMs can be migrated to/from any Nutanix node and even to non Nutanix nodes.
  • Full support from Nutanix including at the Exchange, Hypervisor and Storage layers with support from Microsoft via Premier Support contracts or via TSANet.
  • Lower CAPEX/OPEX as Exchange can be combined with new or existing Nutanix/Virtualization deployment.
  • Reduced datacenter costs including Power, Cooling , Space (RU)

I hope you enjoy the Best Practice guide and look forward to hearing about your MS Exchange on Nutanix questions & experiences.