The value of the hyperscaler + hypervisor model

Public cloud offerings for “hyperscalers” such as AWS EC2, Microsoft Azure & Google GCP provide a lot of value when it comes to be able to stand up and run virtual workloads in a timely manner and provide various capabilities to create globally resilient solutions.

All of these offerings also boast a varying/wide range of native services which can compliment or replace services running in traditional virtual machines.

As I’ve previously stated in a post from August 2022, Direct to Cloud Value – Part 1, the hyperscalers have two major advantages customers can benefit from:

  1. A Well understood architecture
  2. Global availability

Designing, deploying and maintaining “on-premises” infrastructure on the other hand is often far less attractive from a time to value perspective and requires significant design efforts by highly qualified, experienced (and paid) individuals in order to get anywhere close to the scalability, reliability and functionality of the hyperscalers.

On-premises infrastructure may not be cost effective for smaller customers/environments who don’t have the quantity of workloads/data to make it cost effective, so “native” public cloud solutions at a high level are often a great choice for customers.

The problem for many customers is they’re established businesses with a wide range of applications from numerous vendors, many of which are not easy to simply migrate to a public cloud provider.

Workload refactoring is often a time consuming and complex task which is not always able to be achieved in a timely manner, and in many cases not at all.

Customers also rarely have the luxury of starting from and/or just building a greenfield environment due to the overall cost and/or the requirement to get a return on investment (ROI) from existing infrastructure.

Customers often have the requirement to burst during peak periods which isn’t something easily achievable on-premises. Customers often need to significantly oversize their on-premises infrastructure just to be able to support end of month, quarter or peak periods such as “Black Friday” for retailers.

This oversizing does help mitigate risks and deliver business outcomes, but it comes at a high cost (CAPEX).

Enter the “Hyperscaler + Hypervisor” model.

The hyperscaler + hypervisor model is where the hyperscaler (AWS/Azure/Goolgle) provides bare metal servers (a.k.a instances) where a hypervisor (in the above example, VMware ESXi) is running along with Virtual SAN (a.k.a “vSAN”) to provide the entire VMware technology stack to run Virtual Machines (VMs).

Nutanix has a similar offering called “Nutanix Cloud Clusters” or “NC2” using their own hypervisor “AHV”.

Both the VMware & Nutanix offerings gives the same look/feel to their customers as they have today on-premises.

The advantages of the hyperscaler + hypervisor model are enormous from both a business and technical perspective, the following are just a few examples.

  • Ease of Migration

A migration of VMware based workloads from an existing on-premises environment can be achieved using a variety of methods including VMware native tools such HCX as well as third party tools from backup vendors such as Commvault without having to refactor workloads.

This is achieved without the cost/complexity and delay of refactoring workloads.

  • Consistent look and feel

The Hyperscaler + hypervisor options provide customers access to the same management tools they’re used to on-premises meaning there is minimal adjustment required for I.T teams.

  • Built-in Cloud exit strategy / No Cloud Vendor “Lock in”

The hypervisor layer allows customers to quickly move from one hyperscaler to another again without refactoring, giving customers real bargaining power when it comes to negotiating commercial arrangements.

It also enables a move off public cloud back to on-premises.

  • Faster Time to value

The ability to stand up net new environments typically within a few hours gives customers the ability to respond to unexpected situations as well as new projects without the time/complexity of procurement and designing/implementing new environments from the ground up.

One very important value here is the ability to respond to critical situations such as ransomware by standing up an entirely isolated net new infrastructure to restore known good data. This is virtually impossible to do on-premises.

  • Lower Risk

In the event of a significant commercial/security/technical issue, a hyperscaler + hypervisor environment can be scaled up, migrated to a new environment/provider or isolated.

This model also mitigates against the delays caused by under-sizing or failure scenarios where new hardware needs to be added as this can occur typically within an hour or so as opposed to days/weeks/months.

As in the next example, workloads can simply be “lifted and shifted” minimising the number of changes/risks involved with a public cloud migration.

In the event of hardware failures, new hardware can be added back to the environment/s straight away without waiting for replacement hardware to be shipped/arrive and be installed. This greatly minimises the chance of double/subsequent failures causing an impact to the environment.

In the case of a disaster such a region failure, a new region can be scaled up to restore production whereas standing/scaling up a new on-prem environment is unlikely to occur in a timely manner.

  • Avoiding the need to “re-factor” workloads

Simply lifting and shifting workloads “as-is” on the same underlying hypervisor ensures the migration can occur with as few dependancies (and risks) as possible.

  • Provides excellent performance

The hardware provided by these offerings varies but often are all NVMe storage with latest or close to latest generation CPU/Memory, ensuring customers are not stuck with older generation hardware.

Having all workloads share a pool of NVMe storage also avoids the issue where some instances (VMs) are assigned to a lower tier of storage due to commercial cost constraints which can have significant downstream effects on other workloads/applications.

The all NVMe option in hyperscalers + hypervisor solutions becomes cost effective due to the economies of scale and elimination of “Cloud waste” which I will discuss next.

In many cases customers will be moving from a multiple year old hardware & storage solutions, simply having an all NVMe storage layer can reduce latency and subsequently make more efficient use of CPU/Memory often resulting in significant performance improvements let alone newer generation CPUs.

  • Economies of scale

In many cases, purchasing on a per instance (VM) basis may be attractive in the beginning, but when you reach a certain level of workloads, it makes more sense to buy in bulk (i.e.: A bare metal instance) and run the workloads on top of a hypervisor.

This gives the customer the benefit of the hypervisors ability to efficiently and effectively oversubscribe CPU and with a hyper-converged (HCI) storage layer (Virtual SAN a.k.a vSAN or Nutanix AOS) customers benefit from the native data reduction capabilities such as Compression, Deduplication and Erasure Coding.

  • Avoids native cloud instance constraints a.k.a “Cloud waste”

Virtual Machine “right-sizing” is to this day one of the most under-rated tasks but this can provide not only lower cost, but significant performance improvements for VMs. Cloud Waste occurs when workloads are forced into pre-defined instance sizes where small amounts of resources such as vCPUs or vRAM are assigned to the VM, but not required/use.

When we have the hypervisor layer, instance sizes can be customised to the exact requirements and eliminate cloud waste which I’ve personally observed in many customer environments to be in the range of 20-40%.

Credit: Steve Kaplan for coining the term “Cloud Waste”.

  • Increased Business Continuity / Disaster Recovery options

The cost/complexity involved with building business continuity and disaster recovery (BC/DR) solutions often lead to customers having to accept and try to mitigate significant risks to their businesses.

The hyperscaler + hypervisor model provides a number of options to have very cost effective BC/DR solutions including across multiple providers to mitigate against large global provider outages.

  • An OPEX commercial model

The ability to commit to a monthly minimum spend to get the most attractive rates while having the flexibility to burst when required (albeit at a less attractive price) means customers don’t have to try and fund large CAPEX projects and have the ability to scale in a “just in time” fashion.

Cost

This sounds to good to be true, what about cost?

On face value, these offerings can appear expensive compared to on-premises equivalents, but from the numerous assessments I’ve conducted I am confident the true cost is closer to or even cheaper than on-premises especially when a proper Total Cost of Ownership (TCO) is performed.

Compared with “native cloud” i.e.: Running workloads without the hypervisor layer, the hyperscaler + hypervisor solution will typically save customers 20-40% while providing equal or better performance and resiliency.

One other area which can make costs higher than necessary is a lack of optimisation with the workloads. I highly recommend for both on-premises and hyperscaler models that customers engage an experienced architect to review their environment thoroughly.

The performance benefits of a right sizing exercise are typically be significant AND it saves valuable IT resources (CPU/RAM). It also means less hardware is required to achieve the same or even a better outcome and therefore lowering costs.

Summary

The hyperscaler + hypervisor model has many advantages both commercially and technically and with the ease of setup, migration to and scaling in public cloud, I expect this model to become extremely popular.

I would strongly recommend anyone looking at replacing their on premises infrastructure in the near future do a thorough assessment of these offerings against their business goals.

End-2-End Enterprise Architecture (@E2EEA) has multiple highly experienced and certified staff at the highest level with both VMware (VCDX) and Nutanix (NPX) technologies and can provide expert level services to help you assess the hyperscaler + hypervisor options as well as design and deliver the solution.

E2EEA can be reached at sales@e2eea.com

IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Back in 2014, I wrote about Hardware support contracts & why 24×7 4 hour onsite should no longer be required. For those of you who haven’t read the article, I recommend doing so prior to reading this post.

In short, the post talked about the concept of the typical old-school requirement to have expensive 24/7, 2 or 4-hour maintenance contracts and how these become all but redundant when IT solutions are designed with appropriate levels of resiliency and have self-healing capabilities capable of meeting the business continuity requirements.

Some of the key points I made regarding hardware maintenance contracts included:

a) Vendors failing to meet SLA for onsite support.

b) Vendors failing to have the required parts available within the SLA.

c) Replacement HW being refurbished (common practice) and being faulty.

d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

All of these are applicable to all vendors and can significantly impact the ability to get the IT infrastructure back online or back to a resilient state where subsequent failures may be tolerated without downtime or data loss.

I thought with the current Coronavirus pandemic, it’s important to revisit this topic and see what we can do to improve the resiliency of our critical IT infrastructure and ensure business continuity no matter what the situation.

Let’s start with “Vendors failing to meet SLA for onsite support.”

At the time of writing, companies the world over are asking employees to work from home and operate on skeleton staff. This will no doubt impact vendor abilities to provide their typical levels of support.

Governments are also encouraging social distance – that people isolate themselves and avoid unnecessary travel.

We would be foolish to assume this won’t impact vendor abilities to provide support, especially hardware support.

What about Vendors failing to have the required parts available within the SLA?

Currently I’m seeing significantly reduced flights operating, e.g.: From USA to Europe which will no doubt delay parts shipment to meet the target service level agreements.

Regarding vendors using potentially faulty refurbished (common practice) hardware, this risk in itself isn’t increased, but if this situation occurs, then the delays for shipment of alternative/new parts is likely going to be delayed.

Lastly, infrastructure leveraging propitatory HW makes it more likely that replacement parts will not be available in a timely manner.

What are some of the options Enterprise Architects can offer their customers/employers when it comes to delivering highly resilient infrastructure to meet/exceed business continuity requirements?

Let’s start with the assumption that replacement hardware isn’t available for one week, which is likely much more realistic than same-day replacement for the majority of customers considering the current pandemic.

Business Continuity Requirement #1: Infrastructure must be able to tolerate at least one component failure and have the ability to self heal back to a resilient state where a subsequent failure can be tolerated.

By component failure, I’m talking about things like:

a) HDD/SSDs

b) Physical server/node

c) Networking device such as a switch

d) Storage controller (SAN/NAS controllers, or in the case of HCI, a node)

HDDs/SSDs have been traditionally protected by using RAID and Hot Spares, although this is becoming less common due to RAID’s inherent limitations and high impact of failure.

For physical servers/nodes, products like VMware vSphere, Microsoft Hyper-V and Nutanix AHV all have “High Availability” functions which allow virtual machines to recover onto other physical servers in a cluster in the event of a physical server failure.

For networking, typically leaf/spine topologies provide a sufficient level of protection with a minimum of dual connections to all devices. Depending on the criticality of the environment, quad connections may be considered/required.

Lastly with Storage Controllers, traditional dual controller SAN/NAS have a serious constraint when it comes to resiliency in that they require the HW replacement to restore resiliency. This is one reason why Hyper-CXonverged Infrastructure (a.k.a HCI) has become so popular: Some HCI products have the ability to tolerate multiple storage controller failures and continue to function and self-heal thanks to their distributed/clustered architecture.

So with these things in mind, how do we meet our Business Continuity Requirement?

Disclaimer: I work for Nutanix, a company that provides Hyper-Converged Infrastructure (HCI), so I’ll be using this technology as my example of how resilient infrastructure can be designed. With that said the article and the key points I highlight are conceptual and can be applied to any environment regardless of vendor.

For example, Nutanix uses a Scale Out Shared Nothing Architecture to deliver highly resilient and self healing capabilities. In this example, Nutanix has a small cluster of just 5 nodes. The post shows the environment suffering a physical server failure, and then self healing both the CPU/RAM and Storage layers back to a fully resilient state and then tolerating a further physical server failure.

After the second physical server failure, it’s critical to note the Nutanix environment has self healed back to a fully resilient state and has the ability to tolerate another physical server failure.

In fact the environment has lost 40% of its infrastructure and Nutanix still maintains data integrity & resiliency. If a third physical server failed, the environment would continue to function maintaining data integrity, though it may not be able to tolerate a subsequent disk failure without data becoming unavailable.

So in this simple example of a small 5-node Nutanix environment, up to 60% of the physical servers can be lost and the business would continue to function.

With all these component failures, it’s important to note the Nutanix platform self healing was completed without any human intervention.

For those who want more technical detail, checkout my post which shows Nutanix Node (server) failure rebuild performance.

From a business perspective, a Nutanix environment can be designed so that the infrastructure can self heal from a node failure in minutes, not hours or days. The platform’s ability to self heal in a timely manner is critical to reduce the risk of a subsequent failure causing downtime or data loss.

Key Point: The ability for infrastructure to self heal back to a fully resilient state following one or more failures WITHOUT human intervention or hardware replacement should be a firm requirement for any new or upgraded infrastructure.

So the good news for Nutanix customers is during this pandemic or future events, assuming the infrastructure has been designed to tolerate one or more failures and self heal, the potential (if not likely) delay in hardware replacements is unlikely to impact business continuity.

For those of you who are concerned after reading this that your infrastructure may not provide the business continuity you require, I recommend you get in touch with the vendor/s who supplied the infrastructure and go through and document the failure scenarios and what impact this has on the environment and how the solution is recovered back to a fully resilient state.

Worst case, you’ll identify gaps which will need attention, but think of this as a good thing because this process may identify issues which you can proactively resolve.

Pro Tip: Where possible, choose a standard platform for all workloads.

As discussed in “Thing to consider when choosing infrastructure”, choosing a standard platform to support all workloads can have major advantages such as:

  1. Reduced silos
  2. Increased infrastructure utilisation (due to reduced fragmentation of resources)
  3. Reduced operational risk/complexity (due to fewer components)
  4. Reduced OPEX
  5. Reduced CAPEX

The article summaries by stating:

“if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.”

What are some of the key factors to improve business continuity?

  1. Keep it simple (stupid!) and avoid silos of bespoke infrastructure where possible.
  2. Design BEFORE purchasing hardware.
  3. Document BUSINESS requirements AND technical requirements.
  4. Map the technical solution back to the business requirements i.e.: How does each design decision help achieve the business objective/s.
  5. Document risks and how the solution mitigates & responds to the risks.
  6. Perform operational verification i.e.: Validate the solution works as designed/assumed & perform this testing after initial implementation & maintenance/change windows.

Considerations for CIOs / IT Management:

  1. Cost of performance degradation such as reduced sales transactions/minute and/or employee productivity/moral
  2. Cost of downtime like Total outage of IT systems inc Lost revenue & impact to your brand
  3. Cost of increased resiliency compared to points 1 & 2
    1. I.e.: It’s often much cheaper to implement a more resilient solution than suffer even a single outage annually
  4. How employees can work from home and continue to be productive

Here’s a few tips to ask your architect/s when designing infrastructure:

  1. Document failure scenarios and the impact to the infrastructure.
  2. Document how the environment can be upgraded to provide higher levels of resiliency.
  3. Document the Recovery Time (RTO) and Recovery Point Objectives (RPO) and how the environment meets/exceeds these.
  4. Document under what circumstances the environment may/will NOT meet the desired RPO/RTOs.
  5. Design & Document a “Scalable and repeatable model” which allows the environment to be scaled without major re-design or infrastructure replacement to cater for unforeseen workload (e.g.: Such as a sudden increase in employees working from home).
  6. Avoid creating unnecessary silos of dissimilar infrastructure

Related Articles:

  1. Scale Out Shared Nothing Architecture Resiliency by Nutanix
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.
  3. Nutanix | Scalability, Resiliency & Performance | Index
  4. Nutanix vs VSAN / VxRAIL Comparison Series
  5. How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.
  6. Enterprise Architecture and avoiding tunnel vision

Solving Oracle & SQL Licensing challenges with Nutanix

The Nutanix platform has and will continue to evolve to meet/exceed the ever increasing customer and application requirements while working within constraints such as licensing.

Two of the most common workloads which I work frequently with customers to design solutions around real or perceived licensing constraints are Oracle and SQL.

In years gone by, Nutanix solutions were constrained to being built around a limited number of node types. When I joined in 2013 only one type existed (NX-3450) which limited customers flexibility and often led to paying more for licensing than a traditional 3-tier solution.

With that said, the ROI and TCO for the Nutanix solutions back then were still more often than not favourable compared to 3-tier but these days we only have more and more good news for prospective and existing customers.

Nutanix has now rounded out the portfolio with the introduction of “Compute Only” nodes to target a select few niche workloads with real or perceived licensing and/or political constraints.

Compute only nodes compliment the traditional HCI nodes (Compute+Storage) as well as our unique Storage Only Nodes which were introduced in mid 2015.

So how do Compute Only nodes help solve these licensing challenges?

In short, Oracle leads the world in misleading and intimidating customers into paying more for licensing than what they need to. One of the most ridiculous claims is “You must license every physical CPU core in your cluster because Oracle could run or have ran on it”.

The below tweet makes fun of Oracle and shows how ridiculous their claim that customers need to license every node in a cluster (which I’ve never seen referenced in any actual contract) is.

So let’s get to how you can design a Nutanix solution to meet a typical Oracle customer licensing constraint while ensuring excellent Scalability, Resiliency and Performance.

At this stage we now assume you’ve given your first born child and left leg to Oracle and have subsequently been granted for example 24 physical core licenses from Oracle, what next?

If we we’re to use HCI nodes, some of the CPU would be utilised by the Nutanix Controller VM (CVM) and while the CVM does add a lot of value (see my post Cost vs Reward for the Nutanix Controller VM) you may be so constrained by licensing that you want to maximise the CPU power for just Oracle workloads.

Now in this example, we have 24 licensed physical cores, so we could use two Compute Only nodes using an Intel Gold 6128 [6 cores / 3.4 GHz] / 12 cores per server for 24 total physical cores.

Next we would assess the storage capacity, resiliency and performance requirements and decide how many and what configuration storage only nodes are required.

Because Virtual Machines cannot run on storage only nodes, the Oracle Virtual Machines cannot and will never run on any other CPU cores other than the two Compute Only nodes therefore you would be in compliance with your licensing.

The below is an example of what the environment could look like.

2CO_4SOnodes

SQL has ever changing CPU licensing models which in some cases are licensed by server or vCPU count, Compute Only can be used in the same way I explained above to address any SQL licensing constraints.

What about if I need to scale storage capacity and/or performance?

You’re in luck, without any modifications to the Oracle workloads, you can simply add one or more storage only nodes to the cluster and it will almost immediately increase capacity, performance and resiliency!

I’ve published an example of the performance improvement by adding storage only nodes to a cluster in an article titled Scale out performance testing with Nutanix Storage Only Nodes which I wrote back in 2016.

In short, the results show by doubling the number of nodes from 4 to 8, the performance almost exactly doubled while delivering low read and write latency.

What if you’ve already invested in Nutanix HCI nodes (example below) and are running Oracle/SQL or any other workloads on the cluster?

TypicalHCIcluster

Nutanix provides the ability to convert a HCI node into a Storage Only node which results in preventing Virtual Machines from running on that node. So all you need to do is add two or more Compute Only nodes to the cluster, then mark the existing HCI nodes as Storage Only and the result is shown below.

CO_PlusConvertedHCI

This is in fact the minimum supported configuration for Compute Only Environments to ensure minimum levels of resiliency and performance. For more information, check out my post “Nutanix Compute Only Minimum requirements“.

Now we have two nodes (Compute Only) which can run Virtual Machines and four nodes (HCI nodes converted to Storage Only) which are servicing the storage I/O. In this scenario, if the HCI nodes have unused CPU and/or RAM the Nutanix Controller VM (CVM) can also be scaled up to drive higher performance & lower latency.

Compute Only is currently available with the Nutanix Next Generation Hypervisor “AHV”.

Now let’s cover off a few of the benefits of running applications like Oracle & SQL on Nutanix:

  1. No additional Virtualization licensing (AHV is included when purchasing Nutanix AOS)
  2. No rip and replace for existing HCI investment
  3. Unique scale out distributed storage fabric (ADSF) which can be easily scaled as required
  4. Storage Only nodes add capacity, performance and resiliency to your mission critical workloads without incurring additional hypervisor or application licensing costs
  5. Compute Only allows scale up and out of CPU/RAM resources where applications are constrained by ONLY CPU/RAM and/or application software licensing.
  6. Storage Only nodes can also provide functions such as Nutanix Files (previously known as Acropolis File Services or AFS)

As a result of Nutanix now having HCI, Storage Only and Compute Only nodes, we’re now entering the time where Nutanix can truely be the standard platform for almost any workload including those with non technical constraints such as political or application licensing which have traditionally been at least perceived to be an advantage for legacy SAN products.

The beauty of the Nutanix examples above is while they look like a traditional 3-tier, we avoid the legacy SAN problems including:

1. Rip and Replace / High Impact / High Risk Controller upgrades/scalability
2. Difficulty in scaling performance with capacity
3. Inability to increase resiliency without adding additional Silos of storage (i.e.: Another dual controller SAN)

With Compute Only being supported by AHV, we also help customers avoid the unnecessary complexity and related operational costs of managing ESXi deployments which have become increasingly more complex over time without significantly improving value to the average customer who simply wants high performance, resilient and easy to manage virtualisation solution.

But what about VMware ESXi customers?

Obviously moving to AHV would be ideal but for those who cannot for whatever reasons can still benefit from Storage Only nodes which provide increased storage performance and resiliency to the Virtual machines running on ESXi.

Customers can run ESXi on Nutanix (or OEM / Software Only) HCI nodes and then scale the clusters performance/capacity with AHV based storage only nodes, therefore eliminating the need to license both ESXi and Oracle/SQL since no virtual machine will run on these nodes.

How does Nutanix compare to a leading all flash array?

For those of you who would like to see a HCI only Nutanix solution have better TCO as well as performance and capacity than a leading All Flash Array, checkout A TCO Analysis of Pure FlashStack & Nutanix Enterprise Cloud where even with giving every possible advantage to Pure Storage, Nutanix still comes out on top without data reduction assumptions.

Now consider that Nutanix the TCO as well as performance and capacity was better than a leading All Flash Array with only HCI nodes, imagine the increased efficiency and flexibility by being able to mix/match HCI, with Storage Only and Compute only.

This is just another example of how Nutanix is eliminating even the corner use cases for traditional SAN/NAS.

For more information about Nutanix Scalability, Resiliency and Performance, checkout this multi-part blog series.