IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Posted on March 16, 2020 by Josh Odgers

Back in 2014, I wrote about Hardware support contracts & why 24×7 4 hour onsite should no longer be required. For those of you who haven’t read the article, I recommend doing so prior to reading this post.

In short, the post talked about the concept of the typical old-school requirement to have expensive 24/7, 2 or 4-hour maintenance contracts and how these become all but redundant when IT solutions are designed with appropriate levels of resiliency and have self-healing capabilities capable of meeting the business continuity requirements.

Some of the key points I made regarding hardware maintenance contracts included:

a) Vendors failing to meet SLA for onsite support.

b) Vendors failing to have the required parts available within the SLA.

c) Replacement HW being refurbished (common practice) and being faulty.

d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

All of these are applicable to all vendors and can significantly impact the ability to get the IT infrastructure back online or back to a resilient state where subsequent failures may be tolerated without downtime or data loss.

I thought with the current Coronavirus pandemic, it’s important to revisit this topic and see what we can do to improve the resiliency of our critical IT infrastructure and ensure business continuity no matter what the situation.

Let’s start with “Vendors failing to meet SLA for onsite support.”

At the time of writing, companies the world over are asking employees to work from home and operate on skeleton staff. This will no doubt impact vendor abilities to provide their typical levels of support.

Governments are also encouraging social distance – that people isolate themselves and avoid unnecessary travel.

We would be foolish to assume this won’t impact vendor abilities to provide support, especially hardware support.

What about Vendors failing to have the required parts available within the SLA?

Currently I’m seeing significantly reduced flights operating, e.g.: From USA to Europe which will no doubt delay parts shipment to meet the target service level agreements.

Regarding vendors using potentially faulty refurbished (common practice) hardware, this risk in itself isn’t increased, but if this situation occurs, then the delays for shipment of alternative/new parts is likely going to be delayed.

Lastly, infrastructure leveraging propitatory HW makes it more likely that replacement parts will not be available in a timely manner.

What are some of the options Enterprise Architects can offer their customers/employers when it comes to delivering highly resilient infrastructure to meet/exceed business continuity requirements?

Let’s start with the assumption that replacement hardware isn’t available for one week, which is likely much more realistic than same-day replacement for the majority of customers considering the current pandemic.

Business Continuity Requirement #1: Infrastructure must be able to tolerate at least one component failure and have the ability to self heal back to a resilient state where a subsequent failure can be tolerated.

By component failure, I’m talking about things like:

a) HDD/SSDs

b) Physical server/node

c) Networking device such as a switch

d) Storage controller (SAN/NAS controllers, or in the case of HCI, a node)

HDDs/SSDs have been traditionally protected by using RAID and Hot Spares, although this is becoming less common due to RAID’s inherent limitations and high impact of failure.

For physical servers/nodes, products like VMware vSphere, Microsoft Hyper-V and Nutanix AHV all have “High Availability” functions which allow virtual machines to recover onto other physical servers in a cluster in the event of a physical server failure.

For networking, typically leaf/spine topologies provide a sufficient level of protection with a minimum of dual connections to all devices. Depending on the criticality of the environment, quad connections may be considered/required.

Lastly with Storage Controllers, traditional dual controller SAN/NAS have a serious constraint when it comes to resiliency in that they require the HW replacement to restore resiliency. This is one reason why Hyper-CXonverged Infrastructure (a.k.a HCI) has become so popular: Some HCI products have the ability to tolerate multiple storage controller failures and continue to function and self-heal thanks to their distributed/clustered architecture.

So with these things in mind, how do we meet our Business Continuity Requirement?

Disclaimer: I work for Nutanix, a company that provides Hyper-Converged Infrastructure (HCI), so I’ll be using this technology as my example of how resilient infrastructure can be designed. With that said the article and the key points I highlight are conceptual and can be applied to any environment regardless of vendor.

For example, Nutanix uses a Scale Out Shared Nothing Architecture to deliver highly resilient and self healing capabilities. In this example, Nutanix has a small cluster of just 5 nodes. The post shows the environment suffering a physical server failure, and then self healing both the CPU/RAM and Storage layers back to a fully resilient state and then tolerating a further physical server failure.

After the second physical server failure, it’s critical to note the Nutanix environment has self healed back to a fully resilient state and has the ability to tolerate another physical server failure.

In fact the environment has lost 40% of its infrastructure and Nutanix still maintains data integrity & resiliency. If a third physical server failed, the environment would continue to function maintaining data integrity, though it may not be able to tolerate a subsequent disk failure without data becoming unavailable.

So in this simple example of a small 5-node Nutanix environment, up to 60% of the physical servers can be lost and the business would continue to function.

With all these component failures, it’s important to note the Nutanix platform self healing was completed without any human intervention.

For those who want more technical detail, checkout my post which shows Nutanix Node (server) failure rebuild performance.

From a business perspective, a Nutanix environment can be designed so that the infrastructure can self heal from a node failure in minutes, not hours or days. The platform’s ability to self heal in a timely manner is critical to reduce the risk of a subsequent failure causing downtime or data loss.

Key Point: The ability for infrastructure to self heal back to a fully resilient state following one or more failures WITHOUT human intervention or hardware replacement should be a firm requirement for any new or upgraded infrastructure.

So the good news for Nutanix customers is during this pandemic or future events, assuming the infrastructure has been designed to tolerate one or more failures and self heal, the potential (if not likely) delay in hardware replacements is unlikely to impact business continuity.

For those of you who are concerned after reading this that your infrastructure may not provide the business continuity you require, I recommend you get in touch with the vendor/s who supplied the infrastructure and go through and document the failure scenarios and what impact this has on the environment and how the solution is recovered back to a fully resilient state.

Worst case, you’ll identify gaps which will need attention, but think of this as a good thing because this process may identify issues which you can proactively resolve.

Pro Tip: Where possible, choose a standard platform for all workloads.

As discussed in “Thing to consider when choosing infrastructure”, choosing a standard platform to support all workloads can have major advantages such as:

Reduced silos
Increased infrastructure utilisation (due to reduced fragmentation of resources)
Reduced operational risk/complexity (due to fewer components)
Reduced OPEX
Reduced CAPEX

The article summaries by stating:

“if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.”

What are some of the key factors to improve business continuity?

Keep it simple (stupid!) and avoid silos of bespoke infrastructure where possible.
Design BEFORE purchasing hardware.
Document BUSINESS requirements AND technical requirements.
Map the technical solution back to the business requirements i.e.: How does each design decision help achieve the business objective/s.
Document risks and how the solution mitigates & responds to the risks.
Perform operational verification i.e.: Validate the solution works as designed/assumed & perform this testing after initial implementation & maintenance/change windows.

Considerations for CIOs / IT Management:

Cost of performance degradation such as reduced sales transactions/minute and/or employee productivity/moral
Cost of downtime like Total outage of IT systems inc Lost revenue & impact to your brand
Cost of increased resiliency compared to points 1 & 2
1. I.e.: It’s often much cheaper to implement a more resilient solution than suffer even a single outage annually
How employees can work from home and continue to be productive

Here’s a few tips to ask your architect/s when designing infrastructure:

Document failure scenarios and the impact to the infrastructure.
Document how the environment can be upgraded to provide higher levels of resiliency.
Document the Recovery Time (RTO) and Recovery Point Objectives (RPO) and how the environment meets/exceeds these.
Document under what circumstances the environment may/will NOT meet the desired RPO/RTOs.
Design & Document a “Scalable and repeatable model” which allows the environment to be scaled without major re-design or infrastructure replacement to cater for unforeseen workload (e.g.: Such as a sudden increase in employees working from home).
Avoid creating unnecessary silos of dissimilar infrastructure

5 easy steps to follow when designing a solution.
1. Document problem statement/s
2. Gather & document requirements, constraints, risks, assumptions
3. Document design decisions/alternatives
4. Provide mitigations to risks
5. Create a scalable/repeatable model#NPX #VCDX #vExpert
— Josh Odgers (@josh_odgers) March 18, 2018

Related Articles:

Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

Posted on October 26, 2014 by Josh Odgers

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for a Virtual Desktop environment?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average VDI user is Task Worker with 1vCPU and 2GB Ram.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. VDI environment costs must be minimized

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Enable TPS and disable Large Memory pages

Justification

1. Disabling Large pages is essential to maximizing the benefits of TPS
2. Not disabling large pages would likely result in minimal TPS savings
3. With Kiosk and Task worker VDI profiles, the percentage of memory which is likely to be shared is higher than for Power users.
4. Existing shared storage has plenty of spare Tier 1 capacity to vSwap files

Implications

1. Sufficient capacity for VM swap files must be catered for.
2. VDI & Storage performance may be impacted significantly in the event of memory contention.
3. Decreased memory costs may result in increased storage costs.
4. During patching, and operational verification that non default settings have not been reverted by the patching of ESXi.
5. Additional CPU overhead on ESXi from enabling TPS.
6. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM,
6. HA admission control (when configured to Percentage of Cluster resources reserved for HA) will only calculate fail-over capacity based on 0MB + VM overhead for each VM which can lead to significantly degraded performance in a HA event.
7. Higher core count (and higher cost) CPUs may be desired to drive overcommitment ratios as RAM will be less likely to be a point of contention.

Alternatives

1. Use 100% memory reservation and leave TPS disabled (default)
2. Use 50% memory reservation and Enable TPS and disable large pages

Related Articles:

1. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

2. Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl(VCDX#104)

Data Locality & Read Cache – Why it’s critical for high performance Horizon View environments (Part 1)

Posted on September 24, 2013 by Josh Odgers

Lets start with describing an common performance issue with Horizon View (Formally VMware View) environments using Linked Clones.

1. Read and Write I/O for all Virtual Desktops is serviced by a central storage controller/s over a storage network (FC, FCoE , IP etc)

2. High performance Caching is on the storage controller which is not local to the ESXi host
3. Performance is 100% dependant on the Storage Area Network or Storage Controllers for performance and where latency or contention exists, performance will suffer.

The End Result is

1. Boot and/or login storms performance may be poor
2. Read I/O heavy tasks such as AV scans are a high impact to the ESXi host, Storage Network and Storage Controllers which can have a high impact on other workloads using the same shared storage (virtual and/or physical)

3. Steady state performance can be impacted poor (such as application load times)

One thing VMware has done to improve this situation is create Content Based Read Cache (CBRC a.k.a View Storage Accelerator).

CBRC has been shown in many tests to greatly improve the performance especially during boot storms, login storms and antivirus scans.

Andre Leibovici (@andreleibovici) wrote a great article on his blog “myvirtualcloud.net” showing the benefits on CBRC here which in summary showed that the I/O to the underlying storage was reduced and performance improved in all three of the above mentioned scenarios.

While CBRC does improve the situation, it does have some limitations.

1. Restricted to 2048MB (2GB)
2. The Bulk of the user data (Read I/O) and ALL Write I/O is still serviced remotely on the Storage Array over the Storage Area Network which still places a serious dependency on the Storage Network and (limited number of) Storage Controllers.
3. Virtual machine host caching is applied only when the virtual machine is powered off. Configuring host caching for a powered-on virtual machine requires virtual machine shutdown to apply the configuration. In the case of linked clone virtual machines, caching cannot be enabled when any of the virtual machines based on the same shared base disk are powered on.
4. Enabling host caching will create additional disks for each boot VMDK and snapshot VMDK, which results in increased storage utilization.
5. If the administrator changes the advanced configuration parameters, some changes might require the vSphere CBRC module to be unloaded and reloaded for the changes to take effect.
6. Host caching cannot be configured for non-vCenter managed pools, or terminal server pools.

Note: The Source for some of the above limitations is the VMware View Storage Accelerator – View 5.1 White paper which can be found here.

In a Nutanix environment, there is a number of features which go further to address these issue. The first is called “Extent Cache” which is a READ cache in DRAM of each Controller VM (CVM). The “Extent Cache” is by default 3072MB but can be increased to whatever size suits your environment.

The below is a visual representation of the Nutanix Extent Cache.

Some of the benefits of Nutanix Extent Cache are

1. Can be sized to suit your requirements (Not limited to 2GB)
2. Does not require the use of CBRC
3. Reduced overhead on the Storage Network (IP Network) as more read I/O can be cached locally
4. Due to Nutanix DFS “Data locality”, data that not stored in Extent Cache is generally accessed locally via SSD. This further reduces the overhead on the Storage Network & dramatically reduces the impact on other controllers within the Nutanix cluster- See my post Data Locality & Why is important for vSphere DRS clusters for more information about Data locality
5. During boot storms, login storms and antivirus scans more data can be served from Cache (Extent Cache) and less read I/O is forced to be served by SSD (or local SATA drives if the data is “Cold”). This not only improves Read performance but makes more I/O available for Write operations which are generally >=65% in VDI environments
6. Works with any Virtual Machine, not just VDI and is hypervisor agnostic.
7. No need to configure a View Desktop Pool to use “Host Caching” as the “Extent Cache” performs this function (and more) at the Nutanix layer automatically
8. No need to configure “Regenerate Cache” or “Blackout Times” which is required for CBRC
9. No additional storage is used by Nutanix Extent Cache
10. Changes to the Nutanix Extent Cache can be done without disruption to the Virtual machines

Note: Extent cache is not limited to desktop workloads, it also works with any type of virtual machine and is operating system agnostic.

In Part 2, We discuss the Nutanix Dynamic Shadow’s feature will be discussed to show how Nutanix ensures data locality for 100% read only data such as Linked Clone Replica’s.

A special Thank you to Jason Langone VCDX#54 (@langonej) for reviewing this post and to Tabrez Memon one of the brilliant Engineers at Nutanix who has worked on features discussed in this post and provided valuable input into this series.

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Tag Archives: virtual desktops

IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Share this:

Share this:

Share this: