Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again! – Part 2

Now continuing from Part 1, lets look at another one of VCE COO Todd Pavone’s statements from the COO: VCE converged infrastructure not affected by Dell-EMC article:

We believe that there was a major gap in the core data center for hyper-converged, where customers wanted hyper-converged architecture — they don’t want to invest in tier-one storage or tier-one servers. They want the intelligence in the software, but they also want massive scale. This is for globals, large service providers in a massive scale, like thousands of nodes. We have a large financial service company in New York that is using us for a platform-free application build-up. And they want to pilot it with 10,000 users, but it’s going to go to 10 million users. And so, can we give them an infrastructure for 10,000, but can scale simply and easily to 10 million — or 20 million?

You can’t do that on an appliance, right? But they want hyper-converged. When you get to 10 million users, you want an infrastructure that scales and is nonlinear, leading to a lower cost model. So, we said, “There’s a gap in that market,” and we created the rack.

Let’s again address these points:

  • Todd: “They don’t want to invest in tier-one storage or tier-one servers. They want the intelligence in the software, but they also want massive scale.”

If customers don’t want to invest in what I would call “traditional” tier one storage and servers, them I’d have to agree with them they need a very different solution, such as Nutanix if they want to get to massive scale, especially if they want easy management & deployment.

Nutanix has customers ranging from 3 to thousands of nodes, in fact many of our large customers run Acropolis Hypervisor. So any question about scalability for Nutanix is just laughable.

  • Todd: “And they want to pilot it with 10,000 users, but it’s going to go to 10 million users. And so, can we give them an infrastructure for 10,000, but can scale simply and easily to 10 million — or 20 million? You can’t do that on an appliance, right?”

Well, you can with Nutanix! In fact that sounds like a common use case for Nutanix, we frequently design and pilot repeatable models and then scale as required.

  • Todd: “But they want hyper-converged. When you get to 10 million users, you want an infrastructure that scales and is nonlinear, leading to a lower cost model. So, we said, “There’s a gap in that market,” and we created the rack.”

It’s no surprise to me at all that customers want Hyperconverged and the ability to scale both linearly and non linearly. Nutanix can do this today and has been able to do it for a long time. Back in 2013 for example, you could mix NX3000 series being Compute heavy / Storage Light with NX6000 nodes which are Compute light and Storage Heavy. This is an example of non linear scaling which achieves the reduced cost (e.g.: Cost/GB) over time.

Then in 2014 an even wider range of nodes were released (NX1000, NX3000, NX6000 & NX8000) which enhanced Nutanix ability to scale both up and out, linearly and non linearly.

In 2015 Nutanix launched the NX-6035C “Storage Only” node which allows customers to Scale Storage separately to Compute, ensuring non linear scaling compute vs storage for customers with high capacity requirements. Importantly, no hypervisor licensing is required to scale storage as storage only nodes run Acropolis Hypervisor (AHV) which is fully interoperable with ESXi and Hyper-V environments.

Remember the Rule of thumb: Don’t scale capacity without scaling storage controllers!

Nutanix Storage Only nodes run a light weight Controller VM (CVM) to ensure Management, Monitoring and Data services (e.g.: Disk Balancing, Compression, Dedupe, Erasure Coding etc) do not degrade even when scaling compute and storage in a vastly non linear manner. Storage only nodes also help improve performance by participating in cluster replication (RF2/RF3) and disk balancing activities.

  • Todd: “So, we said, “There’s a gap in that market,” and we created the rack.”

There may have been a gap back in early 2013, but since then Nutanix has continued to innovate and lead the market with solutions to scale both linearly and non linearly, I’d say the gap has long been filled. Nutanix also scales management with a single HTML 5 GUI called PRISM, with central management of multiple clusters/sites/geographical locations via PRISM central.

Summary:

I’m sure it’s pretty obvious by now VCE COO Todd Pavone and I have different opinions on what HCI is capable of. During my time at Nutanix I have seen countless successful small, medium and large scale mission-critical application deployments and the percentage of Nutanix business from these workloads continues to increase thanks to our investment in a dedicated vBCA team which I am fortunate to be a part of.

Next time you’re considering new infrastructure for mission critical application, reach out and I’ll happily work with you and see if Nutanix is a good fit for your use case.

Let me finish by saying, I can guarantee you that if in the unlikely event the workload/s are not suitable for Nutanix, I will be the first one to tell you, and help you find an alternate solution.

Back to Part 1.

Think HCI is not an ideal way to run your mission-critical x86 workloads? Think again! – Part 1

I recently wrote a post called Fight the FUD: Nutanix scale limitations which corrected some mis-information VCE COO Todd Pavone has stated in this article COO: VCE converged infrastructure not affected by Dell-EMC about Nutanix scalability.

In the same interview, Todd makes several comments ( see quote below) which I can only trust to be accurate for VSPEX Blue but as he refers more generally about Hyper-converged systems, I have to disagree with many of the comments from a Nutanix perspective, and thought it would be good to discuss where I see Nutanix.

Where does VSPEX Blue fit into the portfolio?

Hyper-converged by definition is where you use software to find technology to manage what people like to call a commoditized infrastructure, where there is no external storage. So, the intelligence is in the software, and you don’t require the intelligence in the infrastructure. In the market, everyone has had an appliance, which is just a server with embedded storage or some marketed software, and ideal for edge locations or for single use cases. But you’re not going to put SAP and run your mission-critical business on an appliance. They have scaling challenges, right? You get to a certain number of nodes, and then the performance degrades; you have to then create another cluster, another cluster. It’s just not an ideal way to go run your mission-critical x86 workloads. [It’s] good for an edge, good for a simple form factors, good for single use cases or what I’ll call more simplified workloads.

In this post I will be specifically discussing Nutanix HCI solution, and while I have experience with and opinions about other products in the market, I will let other vendors speak for themselves.

The following quotes are not in the order Todd mentioned them in the above interview, they have been grouped together/ordered to avoid overlap/repeating comments and to make this blog flow better (hopefully). As such, if any comments appear to be taken out of context, it is not my intention.

So let’s break down what Todd has said:

  • Todd: In the market, everyone has had an appliance, which is just a server with embedded storage or some marketed software, and ideal for edge locations or for single use cases.

I agree that Hyper-converged systems such as Nutanix run on commodity servers with embedded storage. I also agree Nutanix is ideal for edge locations and can be successfully used for single use cases, but as my next response will show, I strongly disagree with any implication that Nutanix (as the markets most innovative leader in HCI, source: Gartner with 52% market share according to IDC) is limited to edge or single use cases.

  • Todd: “It’s just not an ideal way to go run your mission-critical x86 workloads” & “But you’re not going to put SAP and run your mission-critical business on an appliance.”

Interestingly, Nutanix is the only certified HCI platform for SAP.

As an architect, when designing for mission critical workloads, I want a platform which can/is:

a) Start small and scale as required (for example as vBCA’s demands increase)
b) Highly resilient & have automated self healing
c) Fully automated non-disruptive (and low impact) maintenance
d) Easy to manage / scale
e) Deliver the required levels of performance

In addition to the above, the fewer dependancies the better, as there is less to go wrong, troubleshoot, create bottlenecks and so on.

Nutanix HCI delivers all of the above, so why wouldn’t you run vBCA on Nutanix? In fact, the question I would ask is, “Why would you run vBCA on legacy 3 tier platforms”!

With legacy 3 tier in my experience it’s more difficult to start small and scale, typically 3-tier solutions have only two controllers which cannot self heal in the event of a failure, have complex and time consuming patching/upgrading procedures, typically have multiple points of Management (not single pane of glass like Nutanix w/ Acropolis Hypervisor), are typically much more difficult to scale (and require rip/replace).

The only thing most monolithic 3-tier products provide (if architected correctly) is reasonable performance.

Here is a typical example of a Nutanix customer upgrade experience compared to a legacy 3-tier product.

HdexTweetUpgrades

Think the above isn’t a fair comparison? I agree! Nutanix vs Legacy is no contest.

When I joined Nutanix in 2013, I was immediately involved with testing of mission critical workloads & I have no problems saying performance was not good enough for some workloads. Since then Nutanix has focused on building out a large team (3 of which are VCDX with years of vBCA experience) focusing on business critical applications, now applications like SQL, Oracle (including RAC deployments), MS Exchange and SAP are becoming common workloads for our customers who originally started with Test/Dev or VDI.

Think of Nutanix like VMware in 2005, everyone was concerned about performance, resiliency and didn’t run business critical applications on VI3 (later renamed vSphere), but over time everyone (including myself) learned virtualization was infact not only suitable for vBCA it’s an ideal platform. I’m here to tell everyone, don’t make the same mistake (we all did with virtualization) and assume Nutanix isn’t suitable for vBCA and wait 5 years to realise the value. Nutanix is more than ready (and has been for a while) for Mission critical applications.

Regarding Todd’s second statement “But you’re not going to put SAP and run your mission-critical business on an appliance.”

If not on an appliance, then what are we supposed to put mission-critical application on? Regardless of what you think of traditional Converged products, the fact is they are actually just a single SKU for multiple different pre-existing products (generally from multiple different vendors) which have been pre-architected and configured. They are not radically different and nor do they eliminate ongoing operational complexity which is a strength of HCI solutions such as Nutanix.

If anything putting mission critical applications on a simple and highly performant/scalable HCI appliance based solution (especially Nutanix) makes more sense than Converged / 3 Tier products. Nutanix is no longer the new kid on the block, Nutanix is well proven across all industries and on different workloads, including mission critical. Hell, most US Federal agencies including the Pentagon uses Nutanix, how much more critical do you want?  (Also anyone saying VDI isn’t mission critical has rock’s in their head! Think if all your users are offline, how productive is your company and how much use are all your servers?)

Imagine if the sizing of a traditional converged solution is wrong, or a mission critical application outgrows it before its scheduled end of life. Well with Nutanix, add one or more nodes (no rip and replace) and vMotion the workload/s, and you’ve scaled completely non disruptively. In fact, with Nutanix you should intentionally start small and scale as close to a just in time fashion as possible so your mission-critical application can take advantage of newer HW over the 3-5 years! Lower CAPEX and better long term performance, sounds like a WIN/WIN to me!

Even if it were true that Converged (or any other product) had higher peak performance (which in the real world has minimal value) than a Nutanix HCI solution, so what? Do you really want to have point solutions (a.k.a Silos) for every different workload? No. I wrote the following post which covers things to consider when choosing infrastructure which covers why you want to avoid silos which I encourage you to read when considering any new infrastructure.

  • Todd: They have scaling challenges, right? You get to a certain number of nodes, and then the performance degrades; you have to then create another cluster, another cluster.”

My previous post Fight the FUD: Nutanix scale limitations covers this FUD off in detail. In short, Nutanix has proven numerous times we can scale linearly, see Scaling to 1 Million IOPS and beyond linearly! for an example (And this video is from October 2013). Note: Ignore the actual IO number, the importaint factor is the linear scalability, not the peak benchmark number which have little value in the real world as I discuss here: “Peak Performance vs Real World Performance”.

  • Todd:  [It’s] good for an edge, good for a simple form factors, good for single use cases or what I’ll call more simplified workloads.

To be honest i’m not sure what he means by “good for a simple form factors”, but I can only assume he is talking about how HCI solutions like Nutanix has compact 4 node per 2RU form factors and use less rack space, power, cooling etc?

As for single use cases, I recommend customers run mixed workloads for several reasons. Firstly, Nutanix is a truly distributed solution which means the more nodes in a cluster, the more performant & resilient the cluster becomes. Scaling out a cluster also helps eliminate silos which reduces waste.

I recently wrote this post: Heterogeneous Nutanix Clusters Advantages & Considerations which covers how mixing node types works in a Nutanix environment. The Nutanix Distributed Storage fabric has lots of back end optimisations (ran by curator) which have been developed over the years to ensure heterogeneous clusters perform well. This is an example of technology which marketing slides can’t represent the value of, but the real world value is huge.

I have been involved with numerous mission critical application deployments, and there are heaps of case studies available on the Nutanix website for these deployments available at http://www.nutanix.com/resources/case-studies/.

A final thought for Part 1, with Nutanix, you can build what you need today and have mission critical workloads benefit from latest generation HW on a frequent basis (e.g.: Annually) by adding new nodes over time and simply vMotioning mission critical VMs to the newer nodes. So over say a 5 year life span of infrastructure, your mission critical applications could benefit from the performance improvements of 5 generations of intel chipsets not to mention the ever increasing efficiency of the Nutanix Acropolis base software (formally known as NOS).

Try getting that level of flexibility/performance improvements with legacy 3 tier!

Next up, Part 2

 

Heterogeneous Nutanix Clusters Advantages & Considerations

Lets start with a simple example, the below shows a 4 node cluster mixing 2 x NX-3060 nodes with 2 x NX-8035 nodes. Both node types share the same Haswell CPU types but the NX-3060 has ~2TB usable and the NX-8035 has ~8TB usable.

3060and8035Mixed

Assuming the cluster capacity was 50% utilized the NDSF layer would look similar to this:

3060and8035cluster50percentused

The above shows the NDSF having a total Storage Pool capacity of 20TB with 50% used (10TB). As we have a heterogenous cluster, we have 2 different node types with vastly different usable capacity.

Nutanix Disk Balancing automatically balances the storage to ensure the utilization percentage of all SSDs/HDDs within the cluster are within +-15%. This means administrators do not have to worry about capacity management on a per node basis, capacity management only needs to be performed at the storage pool (cluster) layer.

Advantage 1: No silos of storage capacity is heterogeneous environments

Advantage 2: NDSF disk balancing ensures the data is evenly distributed throughout the cluster

Advantage 3: There is no requirement for hypervisor level storage capacity management such as Storage DRS (SDRS).

For more information on why Storage DRS is not required see: Storage DRS and Nutanix – To use, or not to use, that is the question?

In a heterogeneous environment, it is likely you will have multiple workloads with different capacity and performance requirements. The below diagram shows the same 4 node cluster, with a single storage pool and 4 containers with different data protection and reduction settings to suit a wide range of application requirements.

Note: The RF3 container shown below would only be possible in clusters of 5 nodes or more, but is shown to illustrate the flexibility/capabilities of NDSF.

HetroClusterCapacity

The storage pool itself has up to 20TB usable (assuming RF2 and excluding data reduction savings). In the Pool we can see four Containers which can be thought of as policies which can be applied to Virtual Machines or Virtual Disks.

Container01 is configured with RF2 and In-Line compression and reports 10TB free space as the underlying storage pool (where capacity is managed) is 50% utilised. Therefore the Container reports free space as all the available capacity within the Storage Pool based on its configured RF.

Container02 has RF2, In-Line compression and EC-X enabled but you will note it also reports 10Tb free space, as capacity is not assigned to a container, its shared between all containers within a Storage Pool.

Container03 is configured with a RF3 which is different to Containers 01 and 02, as such the container reports free space based on its configured RF of 3, so it shows 13.3TB usable and 6.66Tb free space as that is the maximum data that can be supported in that container based on its storage policies.

Container04 reports the same free space as Container 01 and 02, as its configured with the same RF. While Container04 has all data reduction technologies enabled, the Container reports actual free space, as data reduction takes effect the usable capacity will change.

It is possible to set capacity reservations on Containers where an application or tenant requires a guarantee as to the usable capacity available, it is also possible to set limits on containers to prevent workloads using more than a specified amount of capacity. However, for most use cases, I recommend not using Reservations or Limits and simply manage capacity at the Storage Pool layer.

Nutanix also supports VMs with more assigned/used capacity than the node they are running on, for more information see: What if my VMs storage exceeds the capacity of a Nutanix node?

Regardless of what node type/s reside within a Nutanix cluster, there is no advanced settings required to be configured such as Queue Depths, VAAI and multi-pathing, which can be required when mixing legacy storage platforms in the same cluster. There is also no requirement for Storage DRS to manage either performance or capacity as discussed earlier.

Advantage 3: No silos of storage capacity, all capacity is shared in the storage pool

Advantage 4: Storage policies such as RF and Data Reduction can be changed on the fly as required and multiple policies are supported within the same cluster.

For more information about Nutanix data reduction technologies, see: Nutanix Implementation of Data Avoidance & Reduction Technologies

Regardless of the mixture of node types and their respective capacity/performance characteristics, there is no advanced configuration required to achieve optimal performance.

Nutanix automatically manages I/O pathing and as data locality ensures most data is read locally and writes are always written local to the VM and then replicas distributed throughout the cluster, it minimizes the chances of hot spots by default.

In the unlikely event one nodes local SSD tier becomes saturated, NDSF will automatically write data across the shared SSD tier until the local nodes SSD tier has sufficient capacity to resume local writes. This avoids the requirement for a storage admin to take any corrective actions.

Advantage 5: In the unlikely event of saturation of a nodes SSD tier, NDSF automatically redirects new I/O until ILM (tiering) can free up capacity within the local tier.

NDSF natively distributes writes throughout all nodes within the cluster. This means all nodes within heterogeneous clusters increase the capacity, performance and resiliency of the entire cluster.

To increase the performance of a single VM, you have numerous options. All you need to do is migrate (vMotion for ESXi, Live Migration for Hyper-V or Migrate for AHV) to a node with higher spec physical processors, more SSD drives and/or more SATA spindles.

There is no requirement to Storage vMotion, or relocate the VM to a new Datastore/Container. NDSF manages the storage layer automatically and will localize hot data if/when required.

Advantage 6: No silos of storage capacity, all capacity is shared in the storage pool

Advantage 7: All nodes contribute to the capacity, performance and resiliency of the cluster

Heterogeneous clusters are managed by a single HTML 5 GUI called PRISM. There is no need to access multiple management interfaces for different storage types.

Advantage 8: Heterogeneous clusters are managed via a single HTML 5 GUI.

Nutanix also supports Pin to SSD which allows workloads requiring all flash to reside within a hybrid (SSD+SATA) cluster and be guaranteed all flash performance.

VMs or Virtual Disks can also be marked to be stored solely in Flash on the fly if/when required and vice versa.

Advantage 9: No silos required for workloads requiring All Flash performance

Nutanix eliminates the complexity around managing performance at a datastore layer. Nutanix supports up to the chosen hypervisors limits, e.g.: vSphere HA limit is 2048 VMs per datastore. As all controllers within a cluster actively service all datastores (Containers), performance isn’t constrained at a datastore layer like with traditional storage products.

For more information see: Unlimited VMs per datastore? Its not a myth with Nutanix!

Advantage 10: No performance concerns/constraints at the datastore level

What about Considerations for Heterogeneous Clusters?

From a performance perspective, always ensure you size to have your N+x (e.g.: N+1 , N+2 etc) node/s sized >= the largest node in the cluster to ensure in the event of a node failure, workloads benefiting from higher performance nodes can failover to equivalent nodes.

From a capacity perspective, for NDSF to be able to restore the configured RF (RF2 or RF3) in the event of a node failure, sufficient capacity must exist within the storage pool. As such, when using high capacity nodes such as NX-8035s , NX-8150s or NX-6035C storage only nodes, ensure you have >= capacity of the largest node free within the storage pool.

Advantage 11: Performance and availability sizing for heterogeneous clusters is simple.

Another consideration is for mission-critical or high I/O applications, spread these evenly across the nodes and ideally ensure the active working set fits within the local SSD tier. Doing so will maximise performance, but in the event a very large workload cannot fit with the local SSD, its data will resided within the shared SSD tier and be actively serviced by multiple Controller VMs.

For more information about sizing see:  Rule of Thumb: Sizing for Storage Performance in the new world.

Advantage 12: The NDSF shared SSD tier ensures in the event a workload exceeds the local SSD capacity that the application still enjoys all flash performance by distributing data intelligently across the cluster.

Over time, when adding new nodes, VMs can be quickly/easily migrated to newer, higher performance/capacity nodes without any preparation. The VMs will immediately benefit from the newer nodes CPU,RAM and storage performance even if most of its data is still stored on older node types.

Older nodes can be non disruptively removed once they are end of life, again without any preparation or administrator intevenston.

Advantage 13: Workloads on NDSF benefit from newer generation nodes immediately without complex design/migration activities.

Summary:

  • Nutanix supports and recommends heterogeneous clusters
  • No complexity with multi-pathing, it’s optimal out of the box
  • No custom per datastore configuration
  • VAAI just works, no advanced configuration required due to mixed node types
  • No compromise required to mix node types
  • No silos of storage capacity, all capacity is shared in the storage pool
  • All nodes contribute to performance of the cluster
  • No balancing VMs across datastores/storage devices is required to improve performance/resiliency
  • NDSF disk balancing ensures the data is evenly distributed throughout the cluster helping avoid hotspots
  • The distribution of RF traffic throughout the cluster also helps avoid hotspots
  • No silos required for workloads requiring all flash performance
  • NDSF ensures VMs can immediately benefit from the addition of newer generation node types
  • Nodes can be added/removed without system administrator performing data migrations