Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In recent weeks, I have seen numerous RFQs which have the requirement for 24×7 2 or 4hr onsite HW replacement, and while this is not uncommon I’ve been thinking why is this the case?

Over my I.T career spanning coming up on 15 years, in the majority of cases, I have strongly recommended in my designs and Bill of Materials (BoMs) that customers buy 24×7 4 hours onsite hardware maintenance contracts for equipment such as Compute, Storage Arrays , Storage Area Networking and IP network devices.

I have never found it difficult to justify this recommendation, because traditionally if a component in the datacenter fails, such as a Storage Controller, this generally has a high impact on the customers business and could cost tens or hundreds of thousands of dollars or even millions of dollars in revenue depending on the size of the customer.

Not only is loosing a Storage controller general a high impact, it is also a high risk as the environment may no longer have redundancy and a subsequent failure could (and would likely) result in a full outage.

So in this example, a typical storage solution has a Storage Controller failure resulting in degraded performance (due to loosing 50% of the controllers) and high impact/risk to the customer, a customer purchasing 24×7 4 Hour, or even 24×7 2hr support contract makes perfect sense! The question is, why choose HW (or a solution) which puts you at high risk after a single component failure in the first place?

With technology fast changing and over the last year or so, I’ve been involved in many customer meetings where I am asked what I recommend in terms of hardware maintenance contracts (for Nutanix customers).

Normally this question/conversation happens after the discussion about the technology, where I explain various failure scenarios and how resilient a Nutanix cluster is.

My recommendation goes something like this.

If you architect your solution for your desired level of availability (e.g.: N+2) there is no need to buy 24×7 4hr hardware maintenance contract, the default Next Business Day option is perfectly fine.

Justification:

1. In the event of even an entire node failure, the Nutanix cluster will have automatically self healed the configured resiliency factor (2 or 3) well before even a 2hr support contract can provide a technician to be onsite, diagnose the issue and replace hardware.

2. Assuming the HW is replaced on the 2hr mark (not typical in my experience), AND assuming Nutanix was not automatically self healing prior to the drive/node replacement, the replacement drive or node would then START the process of self healing. So the actual time to recovery would be greater than 2hrs. In the case of Nutanix, self heal begins almost immediately.

3. If a cluster is sized for the desired level of availability based on business requirements, say N+2, a Node can fail, Nutanix will automatically self heal and then tolerate a subsequent failure with the ability to full self heal the configured resiliency factor (2 or 3) again.

4. If a cluster is sized only to customer requirement of only N+1, a Node can fail, Nutanix will automatically and fully self heal. Then in the unlikely (but possible) event of a subsequent failure (i.e.: A 2nd node failure before the next business day warranty replaces the failed HW), the Nutanix cluster will still continue to operate.

5. The performance impact of a node failure in a Nutanix environment is N-1, so in a worst case scenario (3 node cluster) the impact is 33%, compared to a 2 controller SAN/NAS where the impact would be 50%. In a 4 node cluster the impact is only 25% and for customer with say 8 nodes only 12.5%. The bigger the cluster the lower the impact. Nutanix recommends N+1 up to 16 nodes, and N+2 up to 32 nodes. Beyond 32 nodes higher levels of availability may be desired based on customer requirements.

The risk and impact of the failure scenario/s is key, in the case of Nutanix, because of the self healing capability, and the fact all controllers and SSDs/HDDs in the cluster participate in the self heal, it can be done very quickly and with low impact. So the impact of the failure is low (N-1) and the recovery is done quickly, so the risk to the business is low, therefore dramatically reducing (and in my opinion potentially removing) the requirement for a 24×7 2 or 4hr support contract for Nutanix customers.

In Summary:

1. The decision on what hardware maintenance contract is appropriate is a business level decision which should be based in part on a comprehensive risk assessment done by an experienced enterprise architect, intimately familiar with all the technology being used.

2. If the recommendation from the trusted experienced enterprise architect is that the risk of HW failure causing high impact or outage to the business is so high that purchasing a 4hr or 2hr onsite HW replacement is required, my advise would be to reconsider if the proposed “solution” meets the business requirements. Only if you are constrained to that solution, purchase a 24×7 2 or 4hr support contract.

3. Being heavily dependant on Hardware being replaced to restore resiliency / performance for a solution, is in itself a high risk to the business.

AND

4. In my experience, it is not uncommon to have problems getting onsite support or hardware replacement regardless of the support contract / SLA. Sometimes this is outside a vendors control, but most vendors will experience one or more of these issues which I have personally experienced on numerous occasions in previous roles:

a) Vendors failing to meet SLA for onsite support.
b) Vendors failing to have the required parts available within the SLA.
c) Replacement HW being refurbished (common practice) and being faulty.
d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

Note: Support contracts don’t promise a resolution by the 2hr / 4hr contract, they simply promise somebody will be onsite and in some cases this is only after you have gone through troubleshooting with the vendor on the phone, sent logs for analysis and so on. So the reality is, the 2hr or 4hr part doesn’t hold much value.

If you have accepted the solution being sold to you OR your an architect recommending a solution which is enterprise grade and highly resilient with self healing capabilities, then consider why you need a 24×7 2hr or 4hr hardware maintenance contract if the solution is architected for the required availability level (i.e.: N+1 / N+2 etc)

So with your next infrastructure purchase (or when making your recommendations if you’re an architect), carefully consider what solution your investing in (or proposing), and if you feel an aggressive 2hr/4hr HW support contract is required, I would recommend revisiting the requirements as you may well be buying (or recommending) something that isn’t resilient enough to meet the requirements.

Food for thought.

2 thoughts on “Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

  1. Josh,

    Great article! I definitely agree on your summary #4. You’re often lucky if you get a callback in 4 hours, much less a part. I’d put some of that on the buyer to research the effectiveness of the support contracts from the vendor. This is, often, well known by the user community, making it a simple cost/benefit analysis to see if the contract is worth the money spent, but it can be extremely frustrating when a stolid vendor suddenly starts failing you. All you can do is hope that future performance falls in line with past performance.

    Summary #2 I think is slightly more complex. I have two thoughts here:

    A) The business need will often boil down to “keep things running *affordably*”, where affordable is $repair_cost < $outage_cost with some variation in how each side is calculated. With that in mind, the architect and purchasers should determine the cost of N, N+1, N+2 … N+X vs N + support, N+1 + support, etc. The delta between NBD and 4hr response can be cheaper than the delta between N+1 and N+2 for various components.

    B) There's a lot of push at every level to "standardize", which can lead to a desire to have the same response level on all contracts. You can also mix and match, where compute may be N+2/NBD and storage processors are N+1/4hr. The site runbooks need to document this but it shouldn't be too complex for competent employees to handle. Standardizing your hardware contracts just to standardize doesn't end up meeting any business need.

    Both A and B presuppose that you do have competent employees and effective monitoring to not only detect failures, but notify an external system (typically an actual human being at some level) in a timely fashion. Hardware has gotten a lot better about opening cases for you when, say, a drive failure is detected, but there are still a lot of complex situations that require an external system to detect. Maybe Summary #5: Monitor your infrastructure so that whatever support contract you have is used efficiently to meet your business needs!

  2. Pingback: Nutanix Platform Link-O-Rama | vcdx133.com

Leave a Reply