Nutanix Acropolis Hypervisor (AHV) certified for 30k Microsoft Exchange Mailboxes

Last year Nutanix announced we had successfully completed Microsoft Exchange Solution Review Program (ESRP) certification for Hyper-V, now I am pleased to announce we have continued our focus on giving customers choice to deploy business critical applications on any hypervisor, and have now achieved ESRP for our Acropolis Hypervisor (AHV).

I believe Acropolis Hypervisor (AHV) and the Nutanix platform is a great choice for business critical applications such as MS Exchange as it gives all the benefits of virtualization, without the complexity of legacy hypervisors and management platforms.

For more information on the advantages of AHV specifically for MS Exchange see:  MS Exchange on Nutanix Acropolis Hypervisor (AHV).

The Nutanix listing on the Microsoft Exchange Solution Review Program can be found at the following URL for both Hyper-V and AHV.

Exchange Solution Reviewed Program (ESRP) – Storage

The Nutanix Best Practice guide for MS Exchange on AHV is also due for release shortly, so stay tuned!

Related Articles:

1. Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again!

2. Jetstress Testing with Intelligent Tiered Storage Platforms

3. Microsoft Exchange 2013/2016 Jetstress Performance Testing on Nutanix Acropolis Hypervisor (AHV)

4, Peak performance vs Real World – Exchange on Nutanix Acropolis Hypervisor (AHV)

Benchmark(et)ing Nonsense IOPS Comparisons, if you insist – Nutanix AOS 4.6 outperforms VSAN 6.2

As many of you know, I’ve taken a stand with many other storage professionals to try to educate the industry that peak performance is vastly different to real world performance. I covered this in a post titled: Peak Performance vs Real World Performance.

I have also given a specific example of Peak Performance vs Real World Performance with a Business Critical Application (MS Exchange) where I demonstrate that the first and most significant constraining factor for Exchange performance is compute (CPU/RAM) so achieving more IOPS is unnecessary to achieve the business outcome (which is supporting a given number of Exchange mailboxes/message per day).

However vendors (all of them) who offer products which provide storage, whether it is as a component such as in HCI or a fully focused offering, continue to promote peak performance numbers. They do this because the industry as a whole has and continues to promote these numbers as if they are relevant and trying to one-up each other with nonsense comparisons.

VMware and the EMC federation have made a lot of noise around In-Kernel being better performance than Software Defined Storage running within a VM which is referred to by some as a VSA (Virtual Storage Appliance). At the same time the same companies/people are recommending business critical applications (vBCA) be virtualized. This is a clear contradiction, as I explain in an article I wrote titled In-Kernel verses Virtual Storage Appliance which in short concludes by saying:

…a high performance (1M+ IOPS) solution can be delivered both In-Kernel or via a VSA, it’s simple as that. We are long past the days where a VM was a significant bottleneck (circa 2004 w/ ESX 2.x).

I stand by this statement and the in-kernel vs VSA debate is another example of nonsense comparisons which have little/no relevance in the real world. I will now (reluctantly) cover off (quickly) some marketing numbers before getting to the point of this post.

VMware VSAN 6.2

Firstly, Congratulations to VMware on this release. I believe you now have a minimally viable product thanks to the introduction of software based checksums which are essential for any storage platform.

VMW Claim One: For the VSAN 6.2 release, “delivering over 6M IOPS with an all-flash architecture”

The basic math for a 64 node cluster = ~93700 IOPS / node but as I have seen this benchmark from Intel showing 6.7Million IOPS for a 64 node cluster, let’s give VMware the benefit of the doubt and assume its an even 7M IOPS which equates to 109375 IOPS / node.

Reference: VMware Virtual SAN Datasheet

VMW Claim Two: Highest Performance >100K IOPS per node

The graphic below (pulled directly from VMware’s website) shows their performance claims of >100K IOPS per node and >6 Million IOPS per cluster.

Reference: Introducing you to the 4th Generation Virtual SAN

Now what about Nutanix Distributed Storage Fabric (NDSF) & Acropolis Operating System (AOS) 4.6?

We’re now at the point where the hardware is becoming the bottleneck as we are saturating the performance of physical Intel S3700 enterprise-grade solid state drives (SSDs) on many of our hybrid nodes. As such we have moved onto performance testing of our NX-9460-G4 model which has 4 nodes running Haswell CPUs and 6 x Intel S3700 SSDs per node all in 2RU.

With AOS 4.6 running ESXi 6.0 on a NX9460-G4 (4 x NX-9040-G4 nodes), Nutanix are seeing in excess of 150K IOPS per node, which is 600K IOPS per 2RU (Nutanix Block).

The below graph shows performance per node and how the solution scales in terms of performance up to a 4 node / 1 block solution which fits within 2RU.

NOS46Perf

So Nutanix AOS 4.6 provides approx. 36% higher performance than VSAN 6.2.

(>150K IOPS per NX9040-G4 node compared to <=110K IOPS for All Flash VSAN 6.2 node)

It should be noted the above Nutanix performance numbers have already been improved upon in upcoming releases going through performance engineering and QA, so this is far from the best you will see.

but-wait-theres-more

Enough with the nonsense marketing numbers! Let’s get to the point of the post:

These 4k 100% random read IOPS (and similar) tests are totally unrealistic.

Assuming the 4k IOPS tests were realistic, to quote my previous article:

Peak performance is rarely a significant factor for a storage solution.

More importantly, SO WHAT if Vendor A (in this case Nutanix) has higher peak performance than Vendor B (in this case VSAN)!

What matters is customer business outcomes, not benchmark(eting)!

holdup

Wait a minute, the vendor with the higher performance is telling you peak performance doesn’t matter instead of bragging about it and trying to make it sound importaint?

Yes you are reading that correctly, no one should care who has the highest unrealistic benchmark!

I wrote things to consider when choosing infrastructure. a while back to highlight that choosing the “Best of Breed” for every workload may not be a good overall strategy, as it will require management of multiple silos which leads to inefficiency and increased costs.

The key point is if you can meet all the customer requirements (e.g.: performance) with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads. So if Vendor X has 10% faster performance (even for your specific workload) than Vendor Y but Vendor Y still meets your requirements, performance shouldn’t be a significant consideration when choosing a product.

Both VSAN and Nutanix are software defined storage and I expect both will continue to rapidly improve performance through tuning done completely in software. If we were talking about a product which is dependant on offloading to Hardware, then sure performance comparisons will be relevant for longer, but VSAN and Nutanix are both 100% software and can/do improve performance in software with every release.

In 3 months, VSAN might be slightly faster. Then 3 months later Nutanix will overtake them again. In reality, peak performance rarely if ever impacts real world customer deployments and with scale out solutions, it’s even less relevant as you can scale.

If a solution can’t scale, or does so in 2 node mirror type configurations then considering peak performance is much more critical. I’d suggest if you’re looking at this (legacy) style of product you have bigger issues.

Not only does performance in the software defined storage world change rapidly, so does the performance of the underlying commodity hardware, such as CPUs and SSDs. This is why its importaint to consider products (like VSAN and Nutanix) that are not dependant on proprietary hardware as hardware eventually becomes a constraint. This is why the world is moving towards software defined for storage, networking etc.

If more performance is required, the ability to add new nodes and the ability to form a heterogeneous cluster and distribute data evenly across the cluster (like NDSF does) is vastly more importaint than the peak IOPS difference between two products.

While you might think that this blog post is a direct attack on HCI vendors, the principle analogy holds true for any hardware or storage vendor out there. It is only a matter of time before customers stop getting trapped in benchmark(et)ing wars. They will instead identify their real requirements and readily embrace the overall value of dramatically simple on-premises infrastructure.

In my opinion, Nutanix is miles ahead of the competition in terms of value, flexibility, operational benefits, product maturity and market-leading customer service all of which matter way more than peak performance (which Nutanix is the fastest anyway).

Summary:

  1. Focus on what matters and determine whether or not a solution delivers the required business outcomes. Hint: This is rarely just a matter of MOAR IOPS!
  2. Don’t waste your time in benchmark(et)ing wars or proof of concept bake offs.
  3. Nutanix AOS 4.6 outperforms VSAN 6.2
  4. A VSA can outperform an in-kernel SDS product, so lets put that in-kernel vs VSA nonsense to rest.
  5. Peak performance benchmarks still don’t matter even when the vendor I work for has the highest performance. (a.k.a My opinion doesn’t change based on my employers current product capabilities)
  6. Storage vendors ALL should stop with the peak IOPS nonsense marketing.
  7. Software-defined storage products like Nutanix and VSAN continue to rapidly improve performance, so comparisons are outdated soon after publication.
  8. Products dependant upon propitiatory hardware are not the future
  9. Put a high focus on the quality of vendors support.

Related Articles:

  1. Peak Performance vs Real World Performance
  2. Peak performance vs Real World – Exchange on Nutanix Acropolis Hypervisor (AHV)
  3. The Key to performance is Consistency
  4. MS Exchange Performance – Nutanix vs VSAN 6.0
  5. Scaling to 1 Million IOPS and beyond linearly!
  6. Things to consider when choosing infrastructure.

Problem: ROBO/Dark Site Management, Solution: XCP + AHV

Problem:

Remote office / Branch Office commonly referred to as “ROBO” and dark sites (i.e.: offices without local support staff and/or network connectivity to a central datacenter) are notoriously difficult to design, deploy and manage.

Why have infrastructure at ROBO?

The reason customers have infrastructure at ROBO and/or Dark Sites is because these sites require services which cannot be provided centrally due to any number of constraints such as WAN bandwidth/latency/availability or, more frequently, security constraints.

Challenges:

Infrastructure at ROBO and/or dark sites need to be functional, highly available and performant without complexity. The problem is as the functional requirements of the ROBO/dark Sites are typically not dissimilar to the infrastructure in the datacenter/s, the complexity of these sites can be equal to the primary datacenter if not greater due to the reduced budgets for ROBOs.

This means in many cases the same management stack needs to be designed on a smaller scale, deployed and somehow managed at these remote/secure sites with minimal to no I.T presence onsite.

Alternatively, Management may be ran centrally but this can have its own challenges especially when WAN links are high latency/low bandwidth or unreliable/offline.

Typical ROBO deployment requirements.

Typical requirements are in many cases not dis-similar to those of the SMB or enterprise and include things like High Availability (HA) for VMs, so a minimum of 2 nodes and some shared storage. Customers also want to ensure ROBO sites can be centrally managed without deployment of complex tooling at each site.

ROBO and Dark Sites are also typically deployed because in the event of WAN connectivity loss, it is critical for the site to continue to function. As a result, it is also critical for the infrastructure to gracefully handle failures.

So let’s summarise typical ROBO requirements:

  • VM High Availability
  • Shared Storage
  • Be fully functional when WAN/MAN is down
  • Low/no touch from I.T
  • Backup/Recovery
  • Disaster Recovery

Solution:

Nutanix Xtreme Computing Platform (XCP) including PRISM and Acropolis Hypervisor (AHV).

Now let’s dive into with XCP + PRISM + AHV is a great solution for ROBO.

A) Native Cross Hypervisor & Cloud Backup/Recovery & DR

Backup/Recovery and DR are not easy things to achieve or manage for ROBO deployments. Luckily these capabilities are built-in to Nutanix XCP. This includes the ability to take point in time application consistent snapshots and replicate those to local/remote XCP clusters & Cloud Providers (AWS/Azure). These snapshots can be considered backups once replicated to a 2nd location (ideally offsite) as well as be kept locally on primary storage for fast recovery.

ROBO VMs replicated to remote/central XCP deployments can be restored onto either ESXi or Hyper-V via the App Mobility Fabric (AMF) so running AHV at the ROBO has no impact on the ability to recover centrally if required.

This is just another way Nutanix is ensuring customer choice and proves the hypervisor is well and truely a commodity.

In addition XCP supports integration with the market leader in data protection, Commvault.

B) Built in Highly Available, Distributed Management and Monitoring

When running AHV, all XCP, PRISM and AHV management, monitoring and even the HTML 5 GUI are built in. The management stack requires no design, sizing, installation , scaling or 3rd party backend database products such as SQL/Oracle.

For those of you familiar with the VMware stack, XCP + AHV provides capabilities provided by vCenter, vCenter Heartbeat, vRealize Operations Manager, Web Client, vSphere Data Protection, vSphere Replication. And it does this in a highly available and distributed manner.

This means, in the event of a node failure, the management layer does not go down. If the Acropolis Master node goes down, the Master roles are simply taken over by an Acropolis Slave within the cluster.

As a result, the ROBO deployment management layer is self healing which dramatically reduces the complexity and and all but removes the requirement for onsite attendance by I.T.

C) Scalability and Flexibility

XCP with AHV ensures than even when ROBO deployments need to scale to meet compute or storage requirements, the platform does not need to be re-architected, engineered or optimised.

Adding a node is as simple as plugging it in, turning it on and the cluster can be expanded not disruptively via PRISM (locally or remotely) in just a few clicks.

When the equipment becomes end of life, XCP also allows nodes to be non-disruptively removed from clusters and new nodes added, which means after the initial deployment, ongoing hardware replacements can be done without major redesign/reconfiguration of the environment.

In fact, deployment of new nodes can be done by people onsite with minimal I.T knowledge and experience.

D) Built-in One Click Maintenance, Upgrades for the entire stack.

XCP supports one-click, non-disruptive upgrades of:

  • Acropolis Base Software (NDSF layer),
  • Hypervisor (agnostic)
  • Firmware
  • BIOS

This means there is no need for onsite I.T staff to perform these upgrades and XCP eliminates potential human error by fully automating the process. All upgrades are performed one node at a time and only started if the cluster is in a resilient state to ensure maximum uptime. Once one node is upgraded, it is validated as being successful (Similar to a Power on self test or POST) before the next node proceeds. In the event an upgrade fails, the cluster will remain online as I have described in this post.

These upgrades can also be done centrally via PRISM Central.

E) Full Self Healing Capabilities

As I have already touched on, XCP + AHV is a fully self healing platform. From the Storage (NDSF) layer to the virtualization layer (AHV) through to management (PRISM) the platform can fully self heal without any intevenston from I.T admins.

With Nutanix XCP you do not need expensive hardware support contracts or to worry about potential subsequent failures, because the system self heals and does not depend on hardware replacement as I have described in hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Anyone who has ever managed a multi-site environment knows how much effort hardware replacement is, as well as the fact that replacements must be done in a timely manner which can delay other critical work. This is why Nutanix XCP is designed to be distributed and self healing as we want to reduce the workload for sysadmins.

F) Ease of Deployment

All of the above features and functionality can be quickly/easily deployed from out of the box to fully operational ready to run VMs in just minutes.

The Management/Monitoring solutions do not require detailed design (sizing/configuration) as they are all built in and they scale as nodes are added.

G) Reduced Total Cost of Ownership (TCO)

When it comes down to it, ROBO deployments can be critical to the success of a company and trying to do things “cheaper” rarely ends up actually being cheaper. Nutanix XCP may not be the cheapest (CAPEX) but we will be the lowest TCO which is after all what matters.

If you’re a sysadmin and you don’t think you can get any more efficient after reading the above than what you’re doing today, its because you already run XCP + AHV 🙂

In all seriousness, sysadmin’s should be innovating and providing value back to the business. If they are instead spending any significant time “keeping the lights on” for ROBO deployments then their valuable time is not being well utilised.

Summary:

Nutanix XCP + AHV provides all the capabilities required for typical ROBO deployments while reducing the initial implementation and ongoing operational cost/complexity.

With Acropolis Operating System 4.6 and the cross hypervisor backup/recovery/DR capabilities thanks to the App Mobility Fabric (AMF), there is no need to be concerned about the underlying hypervisor as it has become a commodity.

AHV performance and availability is on par if not better than other hypervisors on the market as is clear from several points we have discussed.

Related Articles:

  1. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.