Scale Out Shared Nothing Architecture Resiliency by Nutanix

At VMware vForum Sydney this week I presented “Taking vSphere to the next level with converged infrastructure”.

Firstly, I wanted to thank everyone who attended the session, it was a great turnout and during the Q&A there were a ton of great questions.

I got a lot of feedback at the session and when meeting people at vForum about how the Nutanix scale out shared nothing architecture tolerates failures.

I thought I would summarize this capability as I believe its quite impressive and should put everyone’s mind at ease when moving to this kind of architecture.

So lets take a look at a 5 node Nutanix cluster, and for this example, we have one running VM. The VM has all its data locally, represented by the “A” , “B” and “C” and this data is also distributed across the Nutanix cluster to provide data protection / resiliency etc.

Nutanix5NodeCluster

So, what happens when an ESXi host failure, which results in the Nutanix Controller VM (CVM) going offline and the storage which is locally connected to the Nutanix CVM being unavailable?

Firstly, VMware HA restarts the VM onto another ESXi host in the vSphere Cluster and it runs as normal, accessing data both locally where it is available (in this case, the “A” data is local) and remotely (if required) to get data “B” and “C”.

Nutanix5nodecluster1failed

Secondly, when data which is not local (in this example “B” and “C”) is accessed via other Nutanix CVMs in the cluster, it will be “localized” onto the host where the VM resides for faster future access.

It is importaint to note, if data which is not local is not accessed by the VM, it will remain remote, as there is no benefit in relocating it and this reduces the workload on the network and cluster.

The end result is the VM restarts the same as it would using traditional storage, then the Nutanix cluster “curator” detects if any data only has one copy, and replicates the required data throughout the cluster to ensure full resiliency.

The cluster will then look like a fully functioning 4 node cluster as show below.

5NodeCluster1FailedRebuild

The process of repairing the cluster from a failure is commonly incorrectly compared to a RAID pack rebuild. With a raid rebuild, a small number of disks, say 8, are under heavy load re striping data across a hot spare or a replacement drive. During this time the performance of everything on the RAID pack is significantly impacted.

With Nutanix, the data is distributed across the entire cluster, which even with a 5 node cluster will be at least 20 SATA drives, but with all data being written to SSD then sequentially offloaded to SATA.

The impact of this process is much less than a RAID rebuild as all Nutanix controllers in the cluster participate and take a portion of the workload as a result the impact per disk, per controller ,per node and importantly for production VMs running in the cluster, is greatly reduced.

Essentially, the larger the cluster, the faster the cluster can repair itself, and the lower the impact on production workloads.

Now lets talk about a subsequent ESXi host failure, now we have two failed nodes, and three surviving nodes, and only one copy of data “A” , “B” and “C” as shown below.

Nutanix5NodeCluster2failures1copydata

Now the Nutanix “Curator” detects only one copy of data “A”, “B” and “C” exists and starts to replicate copies of “A”, “B” and “C” across the cluster. This results in the below which is a fully functional and redundant cluster, capable of surviving yet another failure as shown below.

Nutanix5NodeCluster2Failures

Even in this scenario, where two ESXi hosts are lost, the environment still has 60% of its storage controllers (and performance), as compared to a typical traditional storage product where the loss of just two (2) controllers can have your environment completely offline, and even if you only lost a single controller, you would only have 50% of the storage controllers (and performance) available.

I think this really highlights what VMware and players like Google, Facebook & Twitter have been saying for a long time, scaling out not up, and shared nothing architecture is the way of the future. The only question is who will be dominant in bringing this technology to the mass market, and I think you know who I have my money on.

Data Centre Migration Strategies – Part 2 – Lift and Shift

Continuing on from Data Centre Migration Strategies Part 1 – Overview, Part 2 focuses on the “Lift and Shift” method.

I’m sure your reading this and already thinking, “this is the least interesting migration strategy, tell me about vMSC and SRM!” and well, your right, BUT it is important to understand the pros and cons so if you are ever in a situation where you have to use this method (I have on numerous occasions) that the migration is successful.

So what are the pros and cons of this method.

Pros

1. No need to purchase equipment for the new data centre
2. The environment should perform as it did at the original data centre following relocation
3.The approach is simple from a technical perspective ie: No new products are required
4. Low direct cost (Note: Point 8 in Cons)
5. Achieves a Recovery Point Objective (RPO) of zero (0).

Cons

1. The entire environment needs to be fully shut-down
2. The outage for the environment starts from when the servers are shut-down, until completion of operational verification testing at the new datacenter. Note: This may take several days depending on the size of the environment.
3. This method is high risk as the ability to fail back to the original datacenter requires all equipment be physically relocated back. This means the Recovery Time Objective (RTO) cannot be low.
4. The Lift and shift method cannot be tested until at least a significant amount of equipment has been physical relocated
5. In the event of an issue during operational verification at the new data centre, a decision needs to be made to proceed and troubleshoot the issues, OR at what point to fail back.
6. Depending on your environment, a vendor (eg: Storage) may need to revalidate your environment
7. Your migration (and schedule) are heavily dependant on the logistical side of the relocation which may have many factors (eg: Traffic / Weather) which are outside your control which may lead to delays or failed migration.
8. Potentially high indirect cost eg: Downtime, Loss of Business , productivity etc

When to use this method?

1. When purchasing equipment for the new data centre is not possible
2. When extended outages to the environment are acceptable
3. When you have no other options

Recommendations when using “Lift and Shift”

1. Ensure you have accurate wiring and rack diagrams of your datacenter
2. Be prepared with your vendor support contact details on hand as it is common following relocation of equipment to have hardware failures
3. Ensure you have an accurate Operational Verification document which tests every part of your environment from Layer 1 (Physical) all the way to Layer 7 (Application)
4. Label EVERYTHING as you disconnect it at the original datacenter
5. Prior to starting your data centre  migration, discuss and agree on a timeline for the migration and at what point and under what situation do you initiate a fail back.
6. Migrate the minimum amount of physical equipment that is required to get your environment back on-line and do your Operational Verification, then on successful completion of your Operational Verification migrate the remaining equipment. This allows for faster fail-back in the event Operational Verification fails.

In Part 3, we discuss Data centre migrations using VMware Site Recovery Manager. (Coming soon)

 

Data Centre Migration Strategies – Part 1 – Overview

After a recent twitter discussion, I felt a Data Centre migration strategies would be a good blog series to help people understand what the options are, along with the Pros and Cons of each strategy.

This guide is not intended to be a step by step on how to set-up each of these solutions, but a guide to assist you making the best decision for your environment when considering a data centre migration.

So what’s are some of the options when migrating virtual machines from one data centre to another?

1. Lift and Shift

Summary: Shut-down your environment and Physically relocate all the required equipment to the new location.

2. VMware Site Recovery Manager (SRM)

Summary: Using SRM with either Storage Replication Adapters (SRAs) or vSphere Replication (VR) to perform both test and planned migration/s between the data centres.

3. vSphere Metro Storage Cluster (vMSC)

Summary: Using an existing vMSC or by setting up a new vMSC for the migration, vMotion virtual machines between the sites.

4. Stretched vSphere Cluster / Storage vMotion

Summary: Present your storage at one or both sites to ESXi hosts at one or both sites and use vMotion and Storage vMotion to move workloads between sites.

5. Backup & Restore

Summary: Take a full backup of your virtual machines, transport the backup data to a new data centre (physically or by data replication) and restore the backup onto the new environment.

6. Vendor Specific Solutions

Summary: There are countless vendor specific solutions which range from Storage layer, to Application layer and everything in between.

7. Data Replication and re-register VMs into vCenter (or ESXi) inventory

Summary: The poor man’s SRM solution. Setup data replication at the storage layer and manually or via scripts re-register VMs into the inventory of vCenter or ESXi for sites with no vCenter.

Each of the above topics will be discussed in detail over the coming weeks so stay tuned, and if you work for a vendor with a specific solution you would like featured please leave a comment and I will get back to you.