Fight the FUD – Cisco “My VSA is better than your VSA”

Posted on March 15, 2016 by Josh Odgers

It seems like the FUD is surging out of Cisco thick and fast, which is great news since Nutanix is getting all the mind share and recognition as the clear market leader.

The latest FUD from Cisco is their Virtual Storage Appliance (VSA, or what Nutanix calls a Controller VM, or CVM) is better than Nutanix because it provides I/O from across the cluster where as Nutanix only serves I/O locally.

I quite frankly don’t care how Cisco or any other vendor does what they do, I will just explain what Nutanix does and why then you can make up your own mind.

Q1. Does Nutanix only serve I/O locally?

A1. No

Nutanix performs writes (e.g.: RF2/RF3) across two or three nodes before providing an acknowledgement to the guest OS. One copy of the data is always written locally except in the case where the local SSD tier is full in which case all copies will be written remotely.

The above image is courtesy of the Nutanix Bible by Steve Poitras.

It shows that Write I/O is prioritized to be local to the Virtual Machine to enable future Read I/O to be served locally thus removing the network, other nodes/controllers as a potential bottleneck/dependancy and ensuring optimal performance.

This means a single VM can be serviced by “N” number of Controllers concurrently, which improves performance.

Nutanix does this as we want to avoid as many dependancies as possible. Allowing the bulk of Read I/O to be serviced locally helps avoid traditional storage issues like noisy neighbour. By writing locally we also avoid at least 1 network hop and remote controller/node involvement as one of the replica’s is always written locally.

Q2. What if a VM’s active data exceeds the local SSD tier?

A2. I/O will be served by controllers throughout the Cluster

I have previously covered this topic in a post titled “What if my VMs storage exceeds the capacity of a Nutanix node?“. As a quick summary, the below diagram shows a VM on Node B having its data served across a 4 node cluster all from SSD.

The above diagram also shows the node types can be Compute+Storage or Storage Only nodes while still providing SSD tier capacity and a Nutanix CVM to provide I/O and data services such as Compression/Dedupe/Erasure Coding as well as monitoring / management capabilities.

Q3. What if data is not in the SSD tier?

A3. If data is migrated to the SATA tier, it is accessed based on avg latency either locally or remotely.

If data is moved from SSD to SATA, the 1st option is to service the I/O locally, but if the local SATA latency is above a threshold, the I/O will be serviced by a remote replica. This ensures in the unlikely event of contention locally, I/O is not unnecessarily delayed.

For reads from SATA, the bottleneck is the SATA drive itself which means the latency of the network (typically <0.5ms) is insignificant when several ms can be saved by using a replica on drives which are not as busy.

This is explained in more detail in “NOS 4.5 Delivers Increased Read Performance from SATA“.

Q4. Cisco HX outperforms Nutanix

A4. Watch out for 4K unrealistic benchmarks, especially on lower end HW & older AOS releases.

I am very vocal that peak performance benchmarks are a waste of time, as I explain in the following article “Peak Performance vs Real World Performance“.

VMware and EMC constantly attack Nutanix on performance, which is funny because Nutanix AOS 4.6 outperforms VSAN comfortably as I show in this article: Benchmark(et)ing Nonsense IOPS Comparisons, if you insist – Nutanix AOS 4.6 outperforms VSAN 6.2

Cisco will be no different, they will focus on unrealistic Benchmark(et)ing which I will respond to the upcoming nonsense in the not to distant future when it is released.

Coming soon: Cisco HX vs Nutanix AOS 4.6

Summary:

One of the reasons Nutanix is the market leader is our attention to detail. The value of the platform exceeds the sum of its parts because we consider and test all sorts of scenarios to ensure the platform performs in a consistent manner.

Nutanix can do things like remote SATA reads, and track performance and serve I/O from the optimal location because of the truly distributed underlying storage fabric (ADSF). These sort of capabilities are limited or not possible without this kind of underlying fabric.

Related Posts:

Erasure Coding Overheads – Part 1

Posted on February 17, 2016 by Josh Odgers

Erasure Coding has become a hot topic in the Hyperconverged Infrastructure (HCI) world since Nutanix announced its implementation (EC-X) in June 2015 at its inaugural user conference and VMware have followed up recently with support for EC in its 6.2 release for All-Flash deployments.

As this is a new concept to many in the industry there have been a lot of questions about how it works, what are the benefits and of course what are the trade offs.

In short, regardless of vendor Erasure Coding will allow data to be stored with tuneable levels of resiliency such as single parity (similar to RAID 5) and double parity (similar to RAID 6) which provides more usable capacity compared to replication which is more like RAID 1 with ~50% usable capacity of RAW.

Not dissimilar to RAID 5/6, Erasure coding implementations have increased write penalties compared to replication (RF2 for Nutanix or FTT1 VSAN) similar to RAID 1.

For example, the write penalties for RAID are as follows:

RAID 1 = 2
RAID 5 = 4
RAID 6 = 6

Similar write penalties are true for Erasure coding depending on each vendors specific implementation and stripe size (either dynamic/fixed).

I have written a number of posts about Nutanix specific implimentation, for those who are interested see the following deep dive post:

Nutanix – Erasure Coding (EC-X) Deep Dive

VMware has also released a post titled The Use Of Erasure Coding In VMware Virtual SAN 6.2 covering their implementation of Erasure Coding by Christos Karamanolis.

The article is well written and I would like to highlight two quotes from the post which are applicable to any implementation of Erasure coding, including Nutanix EC-X and VSAN.

Quote #1

Erasure Coding does not come for free. It has a substantial overhead in operations per second (IOPS) and networking.

Quote #2

In conclusion, customers must evaluate their options based on their requirements and the use cases at hand. RAID-5/6 may be applicable for some workloads on All-Flash Virtual SAN clusters, especially when capacity efficiency is the top priority. Replication may be the better option, especially when performance is the top priority (IOPS and latency). As always, there is no such thing as one size fits all.

Pros of Erasure Coding:

Increased usable capacity of RAW storage compared to replication
Potential to increase the amount of data stored in SSD tier
Lower cost/GB
Nutanix EC-X Implementation places parity on capacity tier to increase the effective SSD tier size

Cons of Erasure Coding:

Higher write overheads
Higher impact (read) in the event of drive/node failure
Performance will suffer significantly for I/O patterns with high percentage of overwrites
Increased computational overheads

Recommended Workloads to use Erasure Coding:

Write Once Read Many (WORM) workloads are the ideal candidate for Erasure Coding
Backups
Archives
File Servers
Log Servers
Email (depending on usage)

As many of the strong use cases for Erasure coding are workloads not requiring high IO, using Erasure Coding across both performance and capacity tiers can provide significant advantages.

Workloads not ideal for Erasure Coding:

Anything Write / Overwrite Intensive
VDI

This is due to VDI typically being very write intensive which would increase the overheads on the software defined storage. VDI is also typically not capacity intensive thanks to intelligent cloning so EC advantages would be minimal.

Summary:

Regardless of vendor, all Erasure Coding implementations have higher overheads than traditional replication such as Nutanix RF2/RF3 and VSANs FTT1/2.

The overheads will vary depending on:

The configured parity level
The stripe size (which may vary between vendors)
The I/O profile, the more write intensive the higher the overheads
If the striping is performed in-line on all data or post process on write cold data
If the stripe is degraded or not from a drive/node failure

The usable capacity also varies depending on:

The number of nodes in a cluster which can limit the stripe size (see the next point)
The stripe size (dependant on number of nodes in the cluster)
- E.g.: A 3+1 will give usable capacity up to 75% and a 4+1 will give up to 80% usable capacity.

It is importaint to understand as the stripe size increases, the resulting usable capacity increases diminish. As the stripe size increases, so do the overheads on the storage controllers and network. The impact during a failure is also increased as is the risk of a drive or node failure impacting the stripe.

In Part 2, I am planning on publishing testing examples to show the performance delta between typical replication and erasure coding for a write intensive workload.

Related Articles:

Nutanix Acropolis Hypervisor (AHV) certified for 30k Microsoft Exchange Mailboxes

Posted on February 17, 2016 by Josh Odgers

Last year Nutanix announced we had successfully completed Microsoft Exchange Solution Review Program (ESRP) certification for Hyper-V, now I am pleased to announce we have continued our focus on giving customers choice to deploy business critical applications on any hypervisor, and have now achieved ESRP for our Acropolis Hypervisor (AHV).

I believe Acropolis Hypervisor (AHV) and the Nutanix platform is a great choice for business critical applications such as MS Exchange as it gives all the benefits of virtualization, without the complexity of legacy hypervisors and management platforms.

For more information on the advantages of AHV specifically for MS Exchange see: MS Exchange on Nutanix Acropolis Hypervisor (AHV).

The Nutanix listing on the Microsoft Exchange Solution Review Program can be found at the following URL for both Hyper-V and AHV.

Exchange Solution Reviewed Program (ESRP) – Storage

The Nutanix Best Practice guide for MS Exchange on AHV is also due for release shortly, so stay tuned!

1. Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again!

2. Jetstress Testing with Intelligent Tiered Storage Platforms

3. Microsoft Exchange 2013/2016 Jetstress Performance Testing on Nutanix Acropolis Hypervisor (AHV)

4, Peak performance vs Real World – Exchange on Nutanix Acropolis Hypervisor (AHV)

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Tag Archives: performance

Fight the FUD – Cisco “My VSA is better than your VSA”

Erasure Coding Overheads – Part 1

Nutanix Acropolis Hypervisor (AHV) certified for 30k Microsoft Exchange Mailboxes

Exchange Solution Reviewed Program (ESRP) – Storage

Share this:

Share this:

Exchange Solution Reviewed Program (ESRP) – Storage

Share this: