Example Architectural Decision – Network Failover Detection Policy

Problem Statement

What is the most suitable network failover detection policy to be used on the vSwitch or dvSwitch NIC team/s in an environment which uses IP storage and has only 2 physical NICs per vSwitch or dvSwitch?

Assumptions

1. vSphere 5.0 or greater
2. Storage is presented to the ESXi hosts is NFS via Multi Switch Link Aggregation
3. A maximum of 2 physical NICs exist per dvSwitch
4. Physical Switches support “Link state tracking”

Motivation

1. Ensure a reliable network failover detection solution
2. Ensure Multi switch link aggregation can be used for IP storage

Architectural Decision

Enable “Link state tracking” on the physical switches and Use “Link Status”

Justification

1. To work properly, Beacon Probing requires at least 3 NICs for “triangulation”  otherwise a failed link cannot be determined.
2.“Link state tracking” can be enabled on the physical switch to report upstream network failures where an “edge” & “core” network topology is used, therefore preventing the link status from being OK when traffic cannot reach the destination due to an upstream failure
3. Beacon Probing and the “route based on IP hash” network load balancing option is not compatible which prevents a single VMKernel being able to use multiple interfaces for IP storage traffic

Implications

1. Link state tracking needs to be supported and enabled on the physical switches

Alternatives

1. Use “Beacon Probing”

Example Architectural Decision – Host Isolation Response for IP Storage

Problem Statement

What are the most suitable HA / host isolation response when using IP based storage (In this case, Netapp HA Pair in 7-mode) when the IP storage runs over physically separate network cards and switches to ESXi management?

Assumptions

1. vSphere 5.0 or greater (To enable use of Datastore Heartbearting)
2. vFiler1 & vFiler2 reside on different physical Netapp Controllers (within the same HA Pair in 7-mode)
3. Virtual Machine guest operating systems with an I/O timeout of 190 seconds to allow for a Controller fail-over (Maximum 180 seconds)

Motivation

1. Minimize the chance of a false positive isolation response
2.Ensure in the event the storage is unavailable that virtual machines are promptly shutdown to minimize impact on the applications/data.

Architectural Decision

Turn off the default isolation address and configure the below specified isolation addresses, which check connectivity to multiple Netapp vFilers (IP storage) on the vFiler management VLAN and the IP storage interface.

Utilize Datastore heartbeating, checking multiple datastores hosted across both Netapp controllers (in HA Pair) to confirm the datastores themselves are accessible.

Services VLANs
das.isolationaddress1 : vFiler1 Mgmt Interface 192.168.1.10
das.isolationaddress2 : vFiler2 Mgmt Interface 192.168.2.10

IP Storage VLANs
das.isolationaddress3 : vFiler1 vIF 192.168.10.10
das.isolationaddress4 : vFiler2 vIF 192.168.20.10

Configure Datastore Heartbeating with “Select any of the clusters datastores taking into account my preference” and select the following datastores

  • One datastore from vFiler1 (Preference)
  • One datastore from vFiler2 (Preference)
  • A second datastore from vFiler1
  • A second datastore from vFiler2

Configure Host Isolation Response to: Power off.

Justification

1. The ESXi Management traffic is running on a standard vSwitch with 2 x 1GB connections which connect to different physical switches to the IP storage (and Data) traffic (which runs over 10GB connections). Using the ESXi management gateway (default isolation address) to deter main isolation is not suitable as the management network can be offline without impacting the IP storage or data networks. This situation could lead to false positives isolation responses.
2. The isolation addresses chosen test both data and IP storage connectivity over the converged 10Gb network
3. In the event the four isolation addresses (Netapp vFilers on the Services and IP storage interfaces) cannot be reached by ICMP, Datastore heartbeating will be used to confirm if the specified datastores (hosted on separate physical Netapp controllers) are accessible or not before any isolation action will be taken.
4. In the event the two storage controllers do not respond to ICMP on either the Services or IP storage interfaces, and both the specified datastores are inaccessible, it is likely there has been a catastrophic failure in the environment, either to the network, or the storage controllers themselves, in which case the safest option is to shutdown the VMs.
5. In the event the isolation response is triggered and the isolation does not impact all hosts within the cluster, the VM will be restarted by HA onto a surviving host.

Implications

1. In the event the host cannot reach any of the isolation addresses, and datastore heartbeating cannot access the specified datastores, virtual machines will be powered off.

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address

For more details, refer to my post “VMware HA and IP Storage

Native NFS Snapshots (VAAI) w/ VMware View Composer (View 5.1)

Following my post on Netapp Edge VSA and the Rapid Clone utility, I thought it was obvious to write a piece on the new VAAI functionality in VMware View 5.1 which allows the use of Netapp Native NFS snapshots (VAAI) for VMware View Composer linked clone deployments.

This feature is really the missing piece of the puzzle as the Rapid Clone Utility (RCU) could deploy large numbers of desktops very quickly, however it could only do manual pools which may have been a pain point for some customers.

So lets jump right in.

To take advantage of native NFS snapshot functionality within VAAI you need to install the NFS VAAI Plugin.

The official documentation on the plugin can be found at here.

The easiest way however, is too download the offline bundle from now.netapp.com and use the VSC plugin to complete the installation, see below for instructions.

The below screenshots are designed to be visual aids to support the above written instructions.

The below is the VSC plugin main screen.

Click the “Tools” option on the left hand side

Click the “Install on host” button, then select the hosts you want to install the plugin on and press “Install”

Select “Yes” to confirm the installation

The installation will begin as shown below. The installation was not super fast for me, so be patient.

After around 3 mins (in my lab anyway), it should complete, following which, reboot your host/s.

The easiest way to confirm if the installation was successful is to check the “Hardware Accelerated” column (on the far right) for your datastores. Ensure it is now showing “Supported” as per the below example.

If for some reason it still shows “Not Supported”, reboot your host, and if that doesn’t work, reinstall the plugin.

Now that we have the plugin installed, its time to get into VMware View Administrator.

Launch the web interface to your connection broker, and login.

You should see similar to the below after logging in.

Note: My system health shows some errors due to not have signed certificates, this will not impact the functionality.

Now, this article assumes your environment is already configured with a vCenter and View Composer server, like the below. If you do not have vCenter and View Composer configured, this article does not cover these steps.

The below shows the VMware View administrator console, To create a Pool (with or without Native NFS Snapshots) , we use the “Add” button shown below.

In the Pool definitions section, we start at the “Type” menu.

For this example, to use View Composer, we select the “Automated Pool” option and press “Next”.

The “User Assignment” screen gives us two (2) options, both options can leverage the Native NFS snapshots, but in this case, I have selected “Floating”.

In the “vCenter server” menu, Select “View Composer linked clones” and press “Next”.

We are now in the “Settings” section of the Add Pool wizard, Here we set the ID and Display Name, for this example, the ID and display name are both set to “W7TestPool”. After you set your ID and Display name, press “Next”.

In Pool settings,  I have chosen to leave everything default for this example. In the real world, each of these settings should be carefully considered.

In the “Provisioning settings” menu the two (2) main things to do is to set the naming pattern, which should be a logical name for your environment, followed by {n:fixed=3}, this basically results in three (3) digits after your chosen name so you can support VMs 001 through 999.

Then we select the maximum number of desktops and the number of spare desktops.

In this example I want to provision all desktops up-front to demonstrate the speed of deployment.

In a production environment this would not generally be the most efficient setting.

The “View Composer” disks menu allows us to configure “disposable disks”, for this example these are not required as no users will be using the desktops I am deploying as this is a test lab. However, in a production environment this is an option you need to carefully consider.

The “Storage Optimization” menu allows both persistent and replica disks to be separated from OS disks. Again, this is something to carefully consider in your production environments, but is not relevant to this example. As such neither option it used.

Now we select the Parent VM & Snapshot, In this case, I am using a Windows 7 VM which I have prepared. There is nothing special about this image, it is just a bare Windows 7 installation patched using Windows update, nothing more.

For this example, I am using my “MgmtCluster”, which is  just a cluster with my physical ESXi 5.0 host.

The datastores option is important, to make full use of the Native NFS Snapshots, the Parent VM should be in the same NFS datastore.

I have selected “NetappEdge_Vol1” as this is where my Parent VM resides, You have the option to set the “Storage Overcommitment” as shown below, however this is not relevant as we’re using the Native NFS Snapshots option later in the wizard.

The below shows all options are completed, now we hit “Next”.

Here we see we have the option to “Use native NFS snapshots (VAAI), if this is greyed out, you may have an issue with the plugin install OR the datastore you have selected is not on your Netapp Edge/FAS or IBM N-Series controller.

We can also use Host caching (CBRC) which will generally provide good performance, as such I have left it enabled.

In the guest customization section, we can set an AD container where we want the linked clones to reside, in a production environment you should use this feature, but for this demonstration its not relevant.

You can also use QuickPrep or Sysprep – Each has Pros & Cons, but both work with the Native NFS snapshots.

Now we’re done, so all we need to do is hit “Finish”.

Now, I have included the below screen shot of the datastores prior to the Linked clones being deployed as a baseline to show there is 31.50GB Free on “NetappEdge_Vol1” which will be used for this demonstration.

Having completed the “Add Pool” wizard, after a short delay, the initial clone of the master VM will start, you will see a task similar to the below appear.

We can also see from the above the first two clones took just 20 seconds.

Now see the next screen shot, where the tenth VM is powering on, thus confirming the storage (or cloning) part of the process is complete. Note the completed time of 20:53:39, which is 20:50:12 , this means all 10 VMs we’re cloned, and registered to vCenter is just 3mins 27seconds (or 20.7 seconds per 10Gb VM).

At this stage the VMs are all booting up, and customizing etc before registering with the connection broker.

This step in the process is largely dependent on the amount of compute in your cluster and the storage performance (mostly from a read) perspective. As I have only a single host with my servers / storage and desktops running on the same host, the time it takes to complete this step will be longer than a production environment.

In conclusion the new functionality with Native NFS Snapshots (VAAI) clearly demonstrates a significant step forward to improving desktop provisioning times w/ View Composer. It also basically removes the compute and I/O impact on your vSphere cluster and storage array.

The performance appears to be similar to the performance of the Rapid Clone Utility (RCU) without the restriction of having to use “Manual Pools”.

As such I would encourage anyone looking at VDI solutions to consider this technology, as it has a number of important benefits over traditional “dumb disk”.