Problem Statement
What are the most suitable HA / host isolation settings where the environment uses Storage (IBM SVC) with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric where ESXi Management and Virtual Machine traffic run over a highly available data network?
Requirements
1. Ensure in the event of one or more hosts becoming isolated, the environment responds in an automated manner to recover VMs where possible
Assumptions
1.The Network is highly available (>99.999% availability)
2. The Storage is highly available (>99.999% availability)
3. vSphere 5.0 or later
4. ESXi hosts are connected to the network via two physical separate switches via two physical NICs
Constraints
1. FC (Block) based storage
Motivation
1. Meet/Exceed availability requirements
2. Minimize the chance of a false positive isolation event
Architectural Decision
Turn off the default isolation address by setting the below advanced setting
“das.usedefaultisolationaddress” = False
Configure three (3) isolation addresses by setting the below advanced settings
“das.isolationaddress1″ = 192.168.1.1 (Core Router)
“das.isolationaddress2″ = 192.168.1.2 (Core Switch 1 )
“das.isolationaddress3″ = 192.168.1.3 (Core Switch 2 )
Configure Datastore Heartbeating with “Select any of the clusters datastores”
Configure Host Isolation Response to: “Shutdown”
Justification
1. When using FC storage, it is possible for the Management and Virtual Machine Networks to be unavailable, while the Storage network is working perfectly. In this case Virtual machines may not be able to communicate to other servers, but can continuing reading/writing from disk. In this case, they will likely not be servicing customer workloads, as such, Shutting the VM down gracefully allows HA to restart the VM/s on host/s which are not isolated gives the VM a greater chance of being able to resume servicing workloads than remaining on an isolated host.
2. Datastore heartbeating will allow HA to confirm if the host is “isolated” or “failed”. In either case, Shutting down the VM will allow HA to recover the VM on a surviving host.
3. As all storage is presented via Active/Active IBM SVC controllers, there is no benefit is specifying specific datastores to be used for heartbeating
4. The selected isolation addresses were chosen as they are both highly available devices in the network which are essential for network communication and cover the core routing and switching components in the network.
5. In an environment where the Network is highly available an isolation event is extremely unlikely as such, where the three (3) isolation addresses cannot be contacted, it is unlikely the network can be restored in a timely manner OR the host has suffered multiple concurrent failures (eg: Multiple Network Cards) and performing a controlled shutdown helps ensure when the network is recovered, the VMs are brought back up in a consistent state, OR in the event the isolation impacts only a subset of ESXi hosts in the cluster, the VM/s can be recovered by HA and resume normal operations.
Alternatives
1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address
Implications
1. In the event the host cannot reach any of the isolation addresses, virtual machines will be Shutdown
2. Using “Shutdown” as opposed to “Power off” ensures a graceful shutdown of the guest operating system, however this will delay the HA restart of the VM for up to 5 mins (300 seconds) if VMware Tools is unable to do a controlled shutdown, in which case after 300 seconds a “Power Off” will be executed.
3. In the unlikely event of network instability, VMs may be Shutdown prematurely.