Fight the FUD: Nutanix scale limitations

Posted on January 20, 2016 by Josh Odgers

I was reading COO: VCE converged infrastructure not affected by Dell-EMC on TechTarget this morning and came across the following quote from VCE COO Todd Pavone which I found a little amusing.

One of the risks that we see in the marketplace for these appliance players is they’re trying to take that appliance that’s been architected for what I think are more single, simple, edge use cases, and they’re trying to put those into the core. We said, “Rather than trying to do that, we’re going to build an architecture for scale.” Because if you study Nutanix and <Redacted>, any of these companies that we know really well, they have scale limitations. They get to certain nodes sizes, and they break. And then, you have to cut another cluster, you have to cut another cluster.

That’s not ideal for a core data center, because now, you’re managing all of them individually — you can’t tie them into your other core systems. And so, now, you have proliferating silos, which for us is … we think that’s a big no-no. Your operational costs aren’t going to improve.

What doesn’t surprise me is how much focus Nutanix gets from other vendors, especially EMC/VCE. Its a great validation of the success of the Nutanix platform and a great indication of what will be dominant datacenter architecture (Hyperconvered/HCI) and what platform will lead the market (Nutanix XCP) in the future.

As for this post, I will only speak about Nutanix Xtreme Computing Platform (XCP) and not about the other vendor he mentioned as I don’t see the value in talking about other vendors.

The below is my summary of the points Todd has made and my thoughts:

Todd: Nutanix has scale limitations

Josh: Nutanix has no Maximum cluster size (nodes per cluster). In fact, as the Nutanix Distributed Storage Fabric scales, the Write I/O continues to be distributed further meaning higher Write performance.

In this article (Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 3 – Scalability) I cover all aspects of scalability including Management, Performance, Capacity, Resiliency and how scaling effects Operational aspects.

While the above post is focusing on Acropolis Hypervisor (AHV), the scalability is also true when using other supported Hypervisors such as ESXi and Hyper-V within the limitations of those hypervisors.

I wonder if Todd would say vSphere has “Scale limitations” being they support clusters of 64? Probably not, he wouldn’t want to FUD VMware.

Update: Pretty timely claim by Todd when Nutanix has just delivered a >100 node, 2PB solution used for mixed workloads such as eDiscovery for Legal, High Performance SQL, MS Exchange and more.

Todd: They get the certain node sizes and they break?

Josh: I believe Todd may have been referring to “Cluster sizes” as opposed to “Node sizes” but as he is unfamiliar with Nutanix technology he is using incorrect terminology.

The first point covers “cluster” sizing, now I’ll cover nodes sizing. Nutanix along with Dell and Lenovo has numerous different node configurations which range from one to four CPU sockets and up to 768G RAM with various SSD/HDD combinations including All-Flash.

There is not a node size maximum for the Acropolis Base Software (formally known as NOS), its simply a matter of practicality. Nutanix is a distributed platform, not a legacy monolithic centralised platform. As such, scaling out is by design to improve things like resiliency and performance.

Nutanix also recommends against scaling up as this increases the impact in the event of a single node failure. e.g.: A 3 node cluster has an impact of 33% with one node failure, but an 8 node cluster has only a 12.5% impact with one failure.

Todd: They get to certain nodes sizes, and they break. And then, you have to cut another cluster, you have to cut another cluster.

Josh: Apart from repeating himself and using the term “node” incorrectly (again), Todd is implying Nutanix forces you to create new clusters at a given scale (which he fails to mention). As I mentioned earlier, Nutanix has no Maximum cluster size (nodes per cluster).

But as any good architect knows, there are considerations such as failure domains, security and constraints where having multiple clusters may be required or simply advantageous. One of the many great things about Nutanix XCP is multiple clusters (even with different hypervisors) can be managed centrally with PRISM central.

That brings us nicely to Todd’s next point:

Todd: That’s not ideal for a core data center, because now, you’re managing all of them individually

Josh: This statement is the last part of the quoted section, and again Todd is talking management of “nodes” as opposed to clusters. So first point, Nutanix XCP requires 3 nodes to form a cluster and that cluster managed via PRISM Element. Where multiple clusters exist, PRISM central is then used as a single pane of glass to manage all clusters.

The below is a video showing PRISM Element for two clusters then joining them to a PRISM central instance for central management. Note: This is a fairly old video (posted September 22, 2014) as Nutanix has been doing this for a long time, as such, PRISM Element and Central have been enhanced since this was created.

Here is an example of scaling Nutanix VDI for 20K to 200K+ Power User Desktops. It is a good example showing a real world design with Management clusters and VDI clusters which takes into consideration failure domains. This also follows well proven and accepted best practices for VMware Horizon View deployments, where the scale limitations are at the vSphere/Horizon layer, not the Nutanix layer.

Summary:

This is yet another example of one vendor talking nonsense about a vendor they compete with. If its reliable information your after, speak to the vendor who makes the product/s your interested in, get them to tell you about the product then ask to speak with reference customers to validate the information you have been provided.

Competitive vendors will only focus on what they perceive to be the issues with a given competitors platform. A good vendor will focus on their product and not discuss competitors even when asked for comparisons by customers.

To quote a person I have learnt a lot from while at Nutanix, “While our competitors focus on us, We are focusing on our customers” – Dheeraj Pandey Nutanix Founder and CEO.

Fight the FUD!

Follow up posts:

For more information about Nutanix XCP scalability see the following posts:

1. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 3 – Scalability

2. Scaling Hyper-converged solutions – Compute only.

3. Scale Storage separately to Compute on Nutanix!

Nutanix AHV I/O path efficiency

Posted on December 18, 2015 by Josh Odgers

The I/O path in AHV is unlike other hypervisors and is remarkably simple. Each VM is made up of one or more vDisks, with each vDisk presented directly to the VM via iSCSI. vDisks appear to the guest OS as if they were a physical disk or the same as a VMDK does in vSphere environments and do not require any special in guest configuration.

The I/O path for each vDisk bypasses the underlying QEMU storage stack and has a direct TCP connection to the iSCSI target on the local Controller VM. This bypasses any/all queues at the hypervisor layer and allows Stargate to manage the one and only queue.

Importantly, every single vDisk has its own TCP connection to stargate which means vdisks do not share any queues until they hit the storage controller (stargate). This reduces points of contention to Stargate itself and as every AHV node runs a stargate instance (within the CVM), only VMs on the same node share the queue for stargate, further reducing the chances of contention.

For those of you who are not familiar with the underlying Nutanix architecture, check out the below video describing what stargate does.

Because the vDisk is presented as a LUN via iSCSI the commands being sent do not require SCSI protocol emulation and simply send native SCSI commands.

The below diagram shows a VM with 3 vDisks and how they connect to Stargate. You will note QEMU is completely bypassed which optimises the I/O path.

If a Virtual machine has more than 3 vDisks, each additional vDisk will have its own TCP connection.

In the event the local Stargate instance is offline for any reason (e.g.: Rolling One-Click upgrade or CVM failure) each TCP connection will be redirected in a round robin manner across all the CVMs within the Nutanix cluster as described in Acropolis Hypervisor (AHV) I/O Failover & Load Balancing.

2. Advanced Storage Performance Monitoring with Nutanix

3. Why AHV is the next generation hypervisor – 10 Part Series

Fight the FUD: Nutanix Erasure Coding Efficiency

Posted on November 15, 2015 by Josh Odgers

Every now and again you will see one vendor put out information/statements about other vendors technology. 9 times out of 10 its either outdated , incorrect or a deliberate attempt to spread Fear Uncertainty and Doubt (FUD).

Today I discovered something on LinkedIn I thought I would respond too, especially as it was mostly by two sales guys (One Sales Engineer & One Sales Director) from one vendor and two other individuals from other vendors trying to spread FUD.

Two of these vendors according to Gartner, are niche players and the other vendor didn’t even make the quadrant shown below.

Had the sales director simple googled Nutanix Erasure Coding he would have found the following articles which covers all of his questions and provides links to further articles on the topic. But hey, doing that would prevent him being able to spread FUD.

Nutanix – Erasure Coding (EC-X) Deep Dive

The above article refers to the below article which explains what data Nutanix EC-X will take effect on and discussed performance impact.

What I/O will Nutanix Erasure coding (EC-X) take effect on?

But let’s quickly address each point and correct the mis-information:

The “problems” the sales director has with the technical implementation of Nutanix EC-X are as follows, I will respond in-line.

Nutanix gets to decide if the data is hot or cold.

Not sure how this is a problem, would he prefer customers have to manually select data to be considered cold? I think the distributed file system tracking what data hasn’t been written too is a very simple, accurate and totally automated way to decide what data to apply . After all Nutanix is making infrastructure invisible, so yes, We’ll put the engineering work in so the customers can just wear the Nutanix grin. (sorry that was cheesy!)

What happens when I need that data back in production…. I can’t read it natively, so I am going to have to completely rehydrate it to read it again?

EC-X does not remove the data from production! Data which has EC-X applied is not moved to a LUN (lol!). Data remains accessible in the same way it was prior to EC-X taking effect. On read I/O data is not rehydrated, EC-X is simply a more space efficient method of storing data while proving resiliency of N+1 or N+2. EC-X and RF are applied on the same container so the data is not moved when EC-X is applied.

I still have to buy enough storage to size my environment correctly the first time around, with no dedupe,no compression, no nothing… so I’m only making my storage last a bit longer to eke a little more life out of it. It is not solving the problem!

Firstly, without stating what “the problem” is, the statement has no context and is pointless FUD. However I can confirm EC-X works in addition to compression & dedupe both of which can be in-line or post process. All three data reduction technologies also apply to both the SSD and SATA tiers, just to get in-front of any future FUD.

Nutanix recommends customers start small and scale as required since our platform scales so gracefully, but if a customer wants to size for 3-5 years up front (we would help them avoid this BTW) we make assumptions (like every vendor, BTW) as to typical data reduction savings based on the information we have about the customer workload, and size with suitable capacity for at least N+1 to enable fully automated self healing from a node failure.

I can only erasure code very certain, specific workloads. This could be a very small amount of data.

Nutanix EC-X can apply to ANY data stored on the Nutanix Distributed Storage Fabric. As per the Deep Dive post (which this guy clearly didn’t read), Nutanix chooses to apply EC-X to data which is write cold for 60 mins to avoid the inefficiencies of striping data across nodes then having to re-stripe it shortly after following a subsequent write I/O. RF2 (or RF3) is more efficient for write intensive workloads and because Nutanix understands this, we only apply EC-X to non write intensive I/O.

I have a known high overhead on Nutanix anyway, so by using erasure coding, post process, I am reducing even further the amount of resources available to VMs.

Another baseless statement, But lets talk about the amount of resources available to VMs. The CVM size does not increase when EC-X is enabled, and the fact EC-X increases the effective capacity of the SSD tier, it means more data can be served out of SSD. What this results in is lower latency for a larger working set which REDUCES the CPU WAIT for the CVM and for all VMs performing I/O. Less data being stored (up to 2x less with RF3) means less metadata needs to be maintained, so the overheads on the CVM in many ways are reduced.

If Erasure Coding is applied in-line (which BTW Nutanix can do with a simple toggle of a setting, but chooses not too), it means that for write intensive workloads, stripes need to be recalculated frequently which is a high CPU overhead compared to, in Nutanix case RF2 or RF3.

Oh did I mention with EC-X the parity data is stored in the SATA tier, freeing up the SSD tier for even more data to be served with flash performance, this is another example of the increased efficiencies when using EC-X.

I’m still only doing this on a local basis, not globally, those inefficiencies continue to abound.

Ah, just plain wrong! EC-X is applied globally across the entire cluster with only one part of any EC-X stripe per node, ensuring maximum efficiency & resiliency.

Now to reply to one of the funnier comments:

I agree with Alan. IMO, any HCI vendor that offers erasure coding is essentially saying they cannot do in-line deduplication and compression at-speed. So they have to give you an alternative to get storage efficiency using a post-process like erasure coding. However, they still take the storage performance “hit” of having to read-in all the data, perform the calculations, and write it all back out again. This reminds me of how NetApp did post-process deduplication. Customers didn’t like the performance hit, you could only run so many jobs at any given time, and dedup jobs would constantly run-over their schedule and impact the following morning’s performance. Many customers would simply forgo the deduplication process to avoid the resulting headaches.HCI vendors who can perform the data efficiency in-line & at-speed – thus bypassing the need for any kind of post-process – will have a clear advantage over their competition.

So this guy is also saying In-line is best for Erasure Coding as well as dedupe and compression. Well since Nutanix can and does in many cases recommend In-Line dedupe and compression its a bit of a moot point?

Erasure Coding on the other hand, I believe post process based on I/O profile is a more efficient way, as described in What I/O will Nutanix Erasure coding (EC-X) take effect on?

Sure there is an overhead of doing post process, but there is also an overhead on doing in-line which this guy seems to be forgetting. The overhead of in-line is 100% of the I/O suffers the overhead (since its in-line), with post-process applied only to suitable data (being write cold data) the overhead only applies to write cold data, which dramatically reduces the overheads because only the most suitable data for EC-X get processed.

If a customer had 100% Write Once Read Many data, In-line would be more efficient, and Nutanix would configure EC-X in-line. If however data is write hot for the business day, then becomes cold and read only overnight, post process would be orders of magnitude more efficient as the stripes would only be calculated once, as opposed to “N” times depending on how write intensive the data was during he day.

Long story short, In-line and Post-Process both have their use cases, in my experience, most production workloads suit post process erasure coding which is why Nutanix default is post process for write cold data >60mins.

Comparing Nutanix, a HCI distributed platform to Netapp which is a centralised non HCI filer is a bit ridiculous as what does/doesn’t work well for Netapp has nothing to do with Nutanix.

Summary:

The methods the Sales Director is using to spread completely incorrect information in an attempt to create FUD are just a little bit __________ (insert here).

I’d recommend customers/prospects ignore any comments from any vendor being made about another vendor period. If a vendor is spending there time talking about another vendor, politely ask them to leave and invite the vendor being spoken about to come and present as that technology is probably pretty good if other vendors feel the need to talk about it!

For the record, as the LinkedIn thread may “disappear” as a result of this post, the screen shots are below:

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Tag Archives: Nutanix

Fight the FUD: Nutanix scale limitations

Nutanix AHV I/O path efficiency

Fight the FUD: Nutanix Erasure Coding Efficiency

Share this:

Share this:

Share this: