Microsoft Exchange 2013/2016 Jetstress Performance Testing on Nutanix Acropolis Hypervisor (AHV)

Virtualization of business critical application has been common place for a number of years, however it is less well known that these business critical applications are also regularly deployed on Nutanix Hyper-converged Infrastructure (HCI) as I discuss in the following post:

Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again!

I am regularly involved in discussions with customers about how well MS Exchange and other business critical applications perform on Nutanix especially during:

  • Storage software upgrades (Acropolis Base Software)
  • Hypervisor upgrade
  • VMs Migrations (e.g.: vMotion)
  • Failure scenarios.

Customers also ask how Data Locality works with workloads like Exchange which have large amounts of data, what overheads are there if any, how much data is served local vs remote and so on.

As a result, I have created the following series of Videos demonstrating the following:

  • Setting a baseline for Jetstress performance on Node 1
  • Migrating VM to a 2nd node and repeating the Jetstress performance test
  • Migrating VM to a 3rd node and repeating the Jetstress performance test
  • Migrating VM to a 4th node and repeating the Jetstress performance test
  • Migrating the VM back to the 1st node and repeating the Jetstress performance test
  • Repeating the test on the 2nd, 3rd and 4th nodes (second Jetstress run for comparison)
  • Performing a Jetstress performance test on a VM with the local Nutanix Controller VM (CVM) offline (to simulate a CVM failure, Storage Maintenance or Upgrade scenarios)

During the above videos I will show advanced Nutanix Distributed Storage Fabric (NDSF) performance statistics such as how Write I/O is being served and What percentage of data is being served locally verses remotely.

Enjoy the videos:

Part 1 – Setting a baseline for Jetstress performance on Nutanix AHV

Part 2 – Migrating Jetstress to 2nd node and repeating Jetstress test

Part 3 – Migrating Jetstress to 3rd node and repeating Jetstress test

Part 4 – Migrating Jetstress to 4th node and repeating Jetstress test

Part 5 through 8 – Repeat Jetstress Tests on all four nodes. (Coming soon)

Part 9 – Take the local Nutanix Controller VM (CVM) offline and repeat test (Coming soon)

Part 10 – Scale out Performance Validation (Coming soon)

Related Articles:

Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again! – Part 2

Now continuing from Part 1, lets look at another one of VCE COO Todd Pavone’s statements from the COO: VCE converged infrastructure not affected by Dell-EMC article:

We believe that there was a major gap in the core data center for hyper-converged, where customers wanted hyper-converged architecture — they don’t want to invest in tier-one storage or tier-one servers. They want the intelligence in the software, but they also want massive scale. This is for globals, large service providers in a massive scale, like thousands of nodes. We have a large financial service company in New York that is using us for a platform-free application build-up. And they want to pilot it with 10,000 users, but it’s going to go to 10 million users. And so, can we give them an infrastructure for 10,000, but can scale simply and easily to 10 million — or 20 million?

You can’t do that on an appliance, right? But they want hyper-converged. When you get to 10 million users, you want an infrastructure that scales and is nonlinear, leading to a lower cost model. So, we said, “There’s a gap in that market,” and we created the rack.

Let’s again address these points:

  • Todd: “They don’t want to invest in tier-one storage or tier-one servers. They want the intelligence in the software, but they also want massive scale.”

If customers don’t want to invest in what I would call “traditional” tier one storage and servers, them I’d have to agree with them they need a very different solution, such as Nutanix if they want to get to massive scale, especially if they want easy management & deployment.

Nutanix has customers ranging from 3 to thousands of nodes, in fact many of our large customers run Acropolis Hypervisor. So any question about scalability for Nutanix is just laughable.

  • Todd: “And they want to pilot it with 10,000 users, but it’s going to go to 10 million users. And so, can we give them an infrastructure for 10,000, but can scale simply and easily to 10 million — or 20 million? You can’t do that on an appliance, right?”

Well, you can with Nutanix! In fact that sounds like a common use case for Nutanix, we frequently design and pilot repeatable models and then scale as required.

  • Todd: “But they want hyper-converged. When you get to 10 million users, you want an infrastructure that scales and is nonlinear, leading to a lower cost model. So, we said, “There’s a gap in that market,” and we created the rack.”

It’s no surprise to me at all that customers want Hyperconverged and the ability to scale both linearly and non linearly. Nutanix can do this today and has been able to do it for a long time. Back in 2013 for example, you could mix NX3000 series being Compute heavy / Storage Light with NX6000 nodes which are Compute light and Storage Heavy. This is an example of non linear scaling which achieves the reduced cost (e.g.: Cost/GB) over time.

Then in 2014 an even wider range of nodes were released (NX1000, NX3000, NX6000 & NX8000) which enhanced Nutanix ability to scale both up and out, linearly and non linearly.

In 2015 Nutanix launched the NX-6035C “Storage Only” node which allows customers to Scale Storage separately to Compute, ensuring non linear scaling compute vs storage for customers with high capacity requirements. Importantly, no hypervisor licensing is required to scale storage as storage only nodes run Acropolis Hypervisor (AHV) which is fully interoperable with ESXi and Hyper-V environments.

Remember the Rule of thumb: Don’t scale capacity without scaling storage controllers!

Nutanix Storage Only nodes run a light weight Controller VM (CVM) to ensure Management, Monitoring and Data services (e.g.: Disk Balancing, Compression, Dedupe, Erasure Coding etc) do not degrade even when scaling compute and storage in a vastly non linear manner. Storage only nodes also help improve performance by participating in cluster replication (RF2/RF3) and disk balancing activities.

  • Todd: “So, we said, “There’s a gap in that market,” and we created the rack.”

There may have been a gap back in early 2013, but since then Nutanix has continued to innovate and lead the market with solutions to scale both linearly and non linearly, I’d say the gap has long been filled. Nutanix also scales management with a single HTML 5 GUI called PRISM, with central management of multiple clusters/sites/geographical locations via PRISM central.

Summary:

I’m sure it’s pretty obvious by now VCE COO Todd Pavone and I have different opinions on what HCI is capable of. During my time at Nutanix I have seen countless successful small, medium and large scale mission-critical application deployments and the percentage of Nutanix business from these workloads continues to increase thanks to our investment in a dedicated vBCA team which I am fortunate to be a part of.

Next time you’re considering new infrastructure for mission critical application, reach out and I’ll happily work with you and see if Nutanix is a good fit for your use case.

Let me finish by saying, I can guarantee you that if in the unlikely event the workload/s are not suitable for Nutanix, I will be the first one to tell you, and help you find an alternate solution.

Back to Part 1.

Peak performance vs Real World – Exchange on Nutanix Acropolis Hypervisor (AHV)

I wrote a post in April 2015 titled “Peak Performance vs Real World Performance” which discusses how benchmarks are not realistic and the performance shown in benchmarks can rarely be reproduced with real workloads. It has been one of my most popular posts, and I have had overwhelmingly positive feedback, with only a select few still pushing unrealistic peak performance benchmarks as being of value to customers.

I thought I would whip up a post showing an example of benchmarks vs real world performance requirements using MS Exchange Jetstress on Nutanix.

The below is a screen shot from Nutanix PRISM HTML based GUI showing a Virtual Machines Read/Write IOPS , bandwidth and latency during a MS Exchange Jetstress benchmark.

JetstressAHV20160105

The screen shot shows ~4000 Read IOPS and ~4000 Write IOPS at a latency of 1.59ms.

But what does the above really tell us and what does it mean to a customer?

I’ve been quoted as saying “Benchmarks are of little value without context specific to customer requirements!” and I stand by this statement.

Let’s now look at an example of a real customers requirement:

The below is from the Exchange server role requirements calculator and it is a screen shot from the Role requirements tab which shows an estimate of the IOPS required for the Databases and Logs for a single Exchange instance.

ExchangeIOexample

It shows the required IOPS being 536 for the databases and 115 for the logs.

Note: The sizing calculator was for an environment supporting 20000 mailboxes across 3 mailbox servers. As such, the above IO requirements are for ~6666 users.

So now that we have done the MS Exchange solution sizing (shown above is just the storage performance requirements), we understand the requirement to be around 651 mixed Read/Write IOPS per mailbox VM. We can then take a benchmark such as Jetstress and validate that the solution has sufficient storage performance.

To require the ~8000 IOPS the Jetstress test showed, we would need to scale up each Exchange instances to support have a much larger number of users and have each user send/receive 500 emails per day to reach this requirement.

8kJetstressIOPS

But in scaling up each Exchange instance to reach the peak IOPS that even this 3 year old generation Nutanix node can deliver we would vastly exceed the compute sizing recommendations for Exchange 2013 (being 24vCPUs and 96GB RAM) as shown by the calculator below.

ScaleUpExchange

As we can see, for an Exchange instance to require those peak IOPS, we would have to size the Mailbox server VMs with more than 10x the recommended vCPUs (24) and 15x the RAM (96GB). This shows that peak IOPS which can be achieved are not relevant in the real world.

In fact, Exchange generally does not require more than 1000 IOPS. Typically its requires much less, as my earlier example shows. So peak performance numbers are of little/no value as they can’t (and more importantly don’t need to be) reproduced in the real world.

With a tool like Jetstress we can configure a precise Mailbox profiles and test only what you require. If the solution can produce more IOPS than what you need (such as in this example), that’s fine for headroom, but in this day and age where Nutanix allows you to quickly and easily scale (Compute/Storage performance & capacity), I recommend designing for what you need in the foreseeable future (by this I mean 6-12 months) and scale if/when required.

What a benchmark does help you understand is how much headroom a solution has over and above your requirements which can help choose a solution to support mixed workloads, BUT the benchmark would need to be re-ran concurrently with suitable benchmarks for all other applications you intend on mixing to see how the solution behaves with mixed workloads.

As such, single application peak performance benchmarks are almost never valuable (to customers) unless your planning to run application specific silos. I strongly recommend anyone considering implementing an application specific silo, read the following article: Enterprise Architecture & Avoiding tunnel vision.

And… if you’re planning to run application specific silos and/or scaling up workloads to the point they need crazy IOPS, then you’re increasing the size of your failure domains, CAPEX and OPEX which is only doing yourself (or your customer) a disservice. But that’s a topic for another day.

I hope this example shows how real world requirements and performance is vastly different to what a benchmark shows and why peak performance benchmarks should be taken with a grain of salt.

I’ve always said the focus should be on gathering requirements and delivering on business outcomes, not focusing on performance which is typically only a very small part of a solution that delivers a successful business outcome.

Summary:

When sizing an MS Exchange solution on Nutanix, IOPS is not a constraining factor even for large scale deployments. The most common constraining factor is the Microsoft recommended compute maximums being 24 vCPUs and 96GB RAM, which is the same constraint regardless of if you run on Nutanix, or any other virtual / physical platform.

Related Articles: