Ensuring Data Integrity with Nutanix – Part 2 – Forced Unit Access (FUA) & Write Through

In the Integrity of Write I/O for VMs on NFS Datastores Series, I discussed Forced Unit Access (FUA) and Write Through in part 2 which covered a vendor agnostic view of FUA and Write through.

In this series, The goal is too explain how Nutanix can guarantee data integrity and how this is Nutanix Number #1 priority. In addition to this goal, I want to show how Nutanix supports Business Critical Applications such as MS SQL and MS Exchange which have strict storage requirements such as Write Ordering, Forced Unit Access (FUA) , SCSI abort/reset commands and to protect against Torn I/O.

Note: With Windows 2012 onwards FUA is no longer used in favour of issuing a “Flush” of the drives write cache. However this change makes no difference to Nutanix environments because regardless of FUA or a Flush being used, write I/O is not acknowledged until written to persistent media on 2 or more nodes which will be explained further later in this post.

Currently MS Exchange is not supported to run in a VMDK on an NFS datastore/s although interestingly Active Directory and MS SQL servers which have the exact same storage requirements (discussed earlier) are supported. This post will show why Microsoft should allow storage vendors to certify Exchange in a VMDK on NFS datastore deployments to prove compliance with the storage requirements stated earlier.

Note: Nutanix provides support for Exchange 2010/2013 deployments in VMDKs on NFS datastores. Customers can find this support statement on http://portal.nutanix.com/ under article number 000001303.

Firstly I would like to start by stating that FUA is fully supported by VMware ESXi.

In the Microsoft article, Deploying Transactional NTFS, it states:

“The caching control mechanism used by TxF is a flag known as the Force Unit Access (FUA) function. This flag specifies that the drive should write the data to stable media storage before signaling complete.”

Nutanix meets this requirement as all writes are written to persistent media (SSD) on at least two independent nodes and no write caching is performed at any layer including the Nutanix Controller CM (CVM), Physical Storage Controller card or the physical drives themselves.

For more information on how Nutanix is compliant with this requirement click here.

The article also states:

“Some Host Bus Adapters (HBAs) and storage controllers (for example, RAID systems) have built-in battery-backed caches. Because these devices preserve cached data if a power fault occurs, any disks connected to them are not required to honor the FUA flag. Further, a disk whose power supply is protected by an uninterruptable power supply (UPS) does not need to honor the FUA flag. This is because the UPS will maintain power long enough for the disk to flush its cache to the media.”

As discussed with the previous requirement, Nutanix meets this requirement as the write acknowledgement is not given until writes are successfully commited to persistent storage on at least two nodes. As a result, even without a UPS, data integrity can be guaranteed in a Nutanix environment.

For more information on how Nutanix is compliant with this requirement click here.

Another key point in the article is:

“Disabling a drive’s write cache eliminates the requirement for the drive to honor the FUA flag.”

All physical drives (SSD and SATA) in a Nutanix nodes have their write cache disabled, therefore removing the requirement of FUA.

The article concludes with the following:

“Note  For TxF to be capable of consistently protecting your data’s integrity through power faults, the system must satisfy at least one of the following criteria:

 

1. Use server-class disks (SCSI, Fiber Channel)

2. Make sure the disks are connected to a battery-backed caching HBA.

3. Use a storage controller (for example, RAID system) as the storage device.

4. Ensure power to the disk is protected by a UPS.

5. Ensure that the disk’s write caching feature is disabled.”

We have already discussed because Nutanix does not use a non persistent write cache there is no requirement for the OS to issue the FUA flag or the Flush command in Windows 2012 to ensure data is written to persistent media. But for fun lets see how many of the above Nutanix is compliant with.

1. YES – Nutanix uses enterprise grade Intel S3700 SSDs for all write I/O
2. N/A – There is no need for battery backed caching HBAs due to Nutanix write acknowledgement not being given until written to persistent media on two or more nodes
3. YES – Nutanix Distributed File System (NDFS) with Resiliency Factor (RF) 2 or 3
4. Recommended to ensure system uptime but not required to ensure data integrity as writes are not acknowledged until written to persistent media on two or more nodes
5. YESAll write caching features are disabled on all SSDs/HDDs

So to meet Microsoft’s FUA requirements, only one of the above is required. Nutanix meets 3 out of 5 outright, with a 4th being Recommended (but not required) and the final requirement not being applicable.

Write Cache and Write Acknowledgements.

Nutanix does not use a non persistent write cache, period.

When a I/O is issued in a Nutanix environment, if it is Random, it will be sent to the “OpLog” which is a persistent write buffer stored on SSD.

If the I/O is sequential, it is sent straight to the Extent Store which is persistent data storage, also located on SSD.

Both Random and Sequential I/O flows are shown in the below diagram from The Nutanix Bible by @StevenPoitras.

NDFS_IO_basev5

All Writes are also protected by Resiliency Factor (RF) of 2 or 3, meaning 2 or 3 copies of the data are synchronously replicated to other Nutanix nodes within the cluster prior to the write being acknowledged.

To be clear, Write acknowledgements are NOT sent until the data is written to 2 or 3 nodes OpLog or Extent Store (depending on the configured RF). What this means is the requirement for Forced Unit Access (FUA) is achieved as every write is written to persistent media before write acknowledgements are sent regardless of FUA (or Flush) being issued by the OS.

Importantly, this write acknowledgement process is the same regardless of the storage protocol (iSCSI , NFS , SMB 3.0) used to present storage to the hypervisor (ESXi , Hyper-V or KVM).

Physical Drive Configuration

As Nutanix does not use a non persistent write cache, and does not acknowledge writes until written to persistent media on 2 or 3 nodes, that’s the end of the problem right?

Not really, as physical drives also have write caches and in the event of a power failure, its possible (albeit unlikely) data in the cache may not be written to disk even after a write acknowledgement is written.

This is why all physical SSD / SATA drives in a Nutanix environment have the disks write caching feature disabled.

This ensures there is no dependency on Uninterruptable power supplies (UPS) to ensure data is successfully written to the disk in the event of a power failure.

This means Nutanix is compliant with the “Ensure that the disk’s write caching feature is disabled” requirement specified by Microsoft.

Uninterruptible Power Supplies (UPS)

As non persistent write caching is not used either at the Nutanix Controller VM (CVM), Physical Storage Controller OR the physical SSDs/HDDs, the use of a UPS is not a requirement for a Nutanix environment to ensure data integrity, however it is still recommended to use a suitable UPS to ensure uptime of the environment. Assuming a power outage is not catastrophic (e.g.: For a single node) and the cluster is still online, write acknowledgements are still not given until data is written to the configured RF policy as Nutanix nodes are effectively stateless.

The Microsoft article quoted earlier states:

“Further, a disk whose power supply is protected by an uninterruptable power supply (UPS) does not need to honor the FUA flag. This is because the UPS will maintain power long enough for the disk to flush its cache to the media.”

Even in the case a storage solution or disk is protected by a UPS, it requires sufficient time to allow all data in the cache to be written to persistent media. This is a potential risk to data consistency as a UPS is just another link in the chain which can go wrong. This is why Nutanix does not depend on UPS for data integrity.

Another Microsoft Article, Key factors to consider when evaluating third-party file cache systems with SQL Server gives two examples of how data corruption can occur:

“Example 1: Data loss and physical or logical corruption”

“Example 2: Suspect database”

So how does Nutanix protect against these issues?

The article states

“How to configure a product providing file cache from something like non-battery backed cache is specific to the vendor implementation. A few rules, however, can be applied:

1. All writes must be completed in or on stable media before the cache indicates to the operating system that the I/O is finished.

2. Data can be cached as long as a read request serviced from the cache returns the same image as located in or on stable media.”

Regarding the first point: All write I/O is written to persistent media (as is the intention of FUA) as described earlier in this article.

For the second point, the Nutanix Distributed File System (NDFS) read I/O can happen from one of the following places:

  1. “Extent Cache”, located in RAM.
  2. “Content Cache”, located on SSD as per earlier diagram.
  3. “Oplog”, the persistent write cache located on SSD as per earlier diagram.
  4. “Extent store” located on either SSD or SATA depending on if the data is “Hot” or ‘Cold”.
  5. A remote nodes Extent Cache, Content Cache, OpLog or Extent Store

To ensure the Extent Cache (in RAM) is consistent with the Content Cache or Extent Store located on the persistent media, when a write I/O occurs which modifies data which has been cached in the Extent Cache (in RAM), the corresponding data is discarded from the Extent Cache and only promoted back to the Extent Cache if the data profile remains Hot (i.e.: Frequently accessed).

In Summary:

The Nutanix write path guarantees (even without the use of a UPS) writes are written to persistent media and have at least 1 redundant copy on another node in the cluster for resiliency before acknowledging the I/O back to the hypervisor and onto the guest. This is critical to ensuring data consistency/resiliency.

This is in full compliance with the storage requirements of applications such as SQL, Exchange and Active Directory.

——————————————————–

Integrity of Write I/O for VMs on NFS Datastores Series

Part 1 – Emulation of the SCSI Protocol
Part 2 – Forced Unit Access (FUA) & Write Through
Part 3 – Write Ordering
Part 4 – Torn Writes
Part 5 – Data Corruption

Nutanix Specific Articles

Part 6 – Emulation of the SCSI Protocol (Coming soon)
Part 7 – Forced Unit Access (FUA) & Write Through
Part 8 – Write Ordering (Coming soon)
Part 9 – Torn I/O Protection (Coming soon)
Part 10 – Data Corruption (Coming soon)

Related Articles

1. What does Exchange running in a VMDK on NFS datastore look like to the Guest OS?
2. Support for Exchange Databases running within VMDKs on NFS datastores (TechNet)
3. Microsoft Exchange Improvements Suggestions Forum – Exchange on NFS/SMB
4. Virtualizing Exchange on vSphere with NFS backed storage

How to successfully Virtualize MS Exchange – Part 9 – Raw Device Mappings (RDMs)

A Raw Device Mapping or “RDM” allows a VM to access a volume (or LUN) on the physical storage via either Fibre Channel or iSCSI.

When discussing Raw Device Mappings, it is important to highlight there are two types of RDM modes, Virtual Compatability Mode and Physical Compatibility Mode.

See the following article for a detailed breakdown of the Difference between Physical compatibility RDMs and Virtual compatibility RDMs(2009226).

So how does an RDM compare to a VMDK on a Datastore?

VMware released a white paper called Performance Characterization of VMFS and RDM Using a SAN in 2008, which debunked the myth that RDMs gave significantly higher performance than VMDKs on datastores.

So RDMs have NO performance advantage over VMDKs on a Datastore.

With that in mind, what advantages (if any) do RDMs have today?

VMware released their Microsoft Exchange 2010 on VMware
Best Practices Guide which has the following Table on page 14 showing the trade-offs between VMFS and RDMs.

RDMsvsVMDKs

I have highlighted one advantage that RDMs still have over VDMKs on datastore style deployments which is the ability to migrate from a physical Exchange server using centralized SAN storage to a VM without data migration.

However, I find the most common way to migrate from a physical deployment to a virtual deployment is by performing Mailbox migrations to virtualized Exchange servers in an ESXi environment. This avoids the complexities of RDMs and ensures no capacity on the shared storage is wasted (i.e.: Siloed).

The table also lists one other advantage for RDMs, being support for up to 64TB drives whereas Virtual Disks were limited to 2TB for VMFS but this limitation has since been lifted to 62TB in vSphere 5.5.

Recommendation: Do not use RDMs for MS Exchange deployments.

As with Local Storage discussed in Part 7, RDM deployments have more downsides (mainly around inefficiency and complexity) than upsides and I would recommend considering other storage options for Virtualized Exchange deployments.

Other options along with my recommended options will be discussed in the next 2 parts of this series and in upcoming posts on Storage performance and resiliency.

Back to the Index of How to successfully Virtualize MS Exchange.

How to successfully Virtualize MS Exchange – Part 6 – vMotion

Having a virtualized Exchange server opens up the ability to perform vMotion and migrate the VM between ESXi hosts without downtime. This is a handy feature to enable hardware maintenance , upgrades or replacement with no downtime and importantly no loss of resiliency to the application.

In this article, I am talking only about vMotion, not Storage vMotion.

Lets first discuss vMotions requirements and configuration maximums.

vMotion requirements:

1. A VMKernel enabled for vMotion
2. A minimum of 1 x 1Gb NIC
3. Shared storage between source and destination ESXi hosts (recommended).

vMotion Configuration Maximums:

Concurrent vMotion operations per host (1Gb/s network):  4
Concurrent vMotion operations per host (10Gb/s network):  8
Concurrent vMotion operations per datastore: 128

As discussed in Part 4, I recommend using DRS “VM to Host” should rules to ensure DRS does not vMotion Exchange VMs unnecessary while keeping the cluster load balanced.

However, it is still important to design your environment to ensure Exchange VMs can vMotion as fast as possible and with the lowest impact during the syncing of the memory and during the final cutover.

So that brings us to our first main topic, Multi-NIC vMotion.

Multi-NIC vMotion:

Multi-NIC vMotion is a feature introduced in vSphere 5.0 which allows vMotion traffic to be sent concurrently down multiple physical NICs to increase available bandwidth and speed up vMotion activity. This effectively lowers the impact of vMotion and enables larger VMs with very high memory change rates do be vMotioned.

For those who are not familiar with the feature, it is described in depth in VMware KB : Multiple-NIC vMotion in vSphere 5 (2007467) as is the process to set it up on Virtual Standard Switches (VSS) and Virtual Distributed Switches (VDS).

From an Exchange perspective, the larger the MBX/MSR VM’s vRAM, and more importantly the more “active” the memory, the longer the vMotion can take. If vMotion detects the memory change rate is higher than the available bandwidth, the hypervisor will insert micro “stuns” to the VM’s CPU over time until the change rate is low enough to vMotion. This is generally has minimal impact to VMs, including Exchange, but if it can be avoided the better.

So using Multi-NIC vMotion helps as more bandwidth can be utilized which means vMotion activity is either faster, or can support more active memory with a low impact.

vMotion “Slot size”:

A vMotion slot size can be thought of as the compute and ram capacity required to perform a vMotion of a VM between two hosts. So for a VM with 96Gb of vRAM and the same memory reservation, the destination host requires 96Gb of physical RAM to be available to even qualify to begin a vMotion.

The larger the VM, the more of a factor this can become in the design of a vSphere cluster.

For example, The diagram below shows a four ESXi host HA cluster with several large VMs including several which are assigned 96Gb of vRAM as is common with Exchange MBX / MSR VMs.

In this scenario the Exchange VMs are represented by VM #13,15 and 16 and have 96Gb RAM ea.

ClustervMotionSlotSizeBad

The issue here is there is insufficient memory on any host to accommodate a vMotion of any of the Exchange VMs. This leads to complexity during maintenance periods as well as a HA event.

In fact in the above example, if an ESXi host crashed, HA would not be able to restart any of the Exchange VMs.

This goes back to the point I made in Part 5 about always ensuring an N+1 (minimum) configuration for the cluster, as this should in most cases avoid this issue.

In addition to the recommendation in Part 4 about using VM to Host DRS “should” rules to ensure only one Exchange VM runs per host.

Enhanced vMotion Compatibility:

Enhanced vMotion Compatibility or EVC, is used to ensure vMotion compatibility for all the hosts within a cluster. EVC ensures that all hosts in a cluster present the same CPU feature set to virtual machines, even if the actual CPUs on the hosts differ. The end result is configuring EVC prevents vMotion from failing because of incompatible CPUs.

The knowledge based article Enhanced vMotion Compatibility (EVC) processor support (1003212) from VMware explains the EVC modes and compatible CPU models. Note: EVC does not support Intel to AMD or vice versa.

Contrary to popular belief, EVC does not “slow down” the CPU, it only masks processor features that affect vMotion compatibility. The full speed of the processor is still utilized, the only potential performance degradation is where an application is specifically written to take advantage of masked CPU features, in which case that workload may have some performance loss. However this is not the case with MS Exchange and as a result, I recommend EVC always be enabled to ensure the cluster is future proofed and Exchange VMs can be migrated to newer HW seamlessly via vMotion.

For more details on why you should enable EVC, review the Example Architectural Decision – Enhanced vMotion Compatibility.

Jumbo Frames:

Using Jumbo frames helps improve vMotion throughput by reducing the number of packets and therefore interrupts required to migrate the same Exchange VM between two hosts.

Michael Webster @vcdxnz001 (VCDX #66) wrote the following great article showing the benefits of Jumbo Frames for vMotion is up to 19% in Multi-NIC vMotion environments: Jumbo Frames on vSphere 5

So we know there is a significant performance benefit, but what about the downsides of Jumbo Frames?

The following two Example architectural decisions covers the pros and cons of Jumbo Frames, along with justification for using and not using Jumbo for IP Storage. The same concepts are true for vMotion so I recommend you review both decisions and choose which one best suits your requirements/constraints.

Note: Neither decision is “right” or “wrong” but if your environment is configured correctly for Jumbo Frames, you will get better vMotion performance with Jumbo Frames.

  1. Jumbo Frames for IP Storage (Do not use Jumbo Frames)
  2. Jumbo Frames for IP Storage (Use Jumbo Frames)

vMotion Security:

vMotion traffic is unencrypted, as a result anyone with access to the network can sniff the traffic. To avoid this, vMotion traffic should be placed on a dedicated non route-able VLAN.

For more information see: Example Architectural Decision : Securing vMotion & Fault Tolerant Traffic in IaaS/Cloud Environments.

Note: This post is relevant to all environments, not just IaaS/Cloud/Multi-tenant.

Performing a vMotion or entering Maintenance Mode:

As per Part 4, I recommended using VM to Host DRS “should” rules to ensure only one Exchange VM runs per host. This also ensures only one Exchange VM is potentially impacted by vMotion when a host enters maintenance mode.

However, simply entering maintenance mode can kick off up to 8 concurrent vMotion activities when using 10Gb networking for vMotion. In this situation, the length of the vMotion for the Exchange VM will increase and potentially impact performance for a longer period.

As such, I recommend to manually vMotion the Exchange VM onto another host not running any other Exchange VMs (and ideally no other large vCPU/vRAM VMs) and waiting for this to complete before entering the host into maintenance mode.

The benefit of this will depend on the size of your Exchange VMs and the performance of your environment but this is an easy way to minimize the chance of performance issues.

DAG Failovers during vMotion?

This can occur as even a momentary drop of the network during vMotion, or the quiesce of the VM during the final stage of the vMotion exceeds the default Windows cluster heartbeat thresholds.

With vMotion setup correctly and ideally if using Multi-NIC vMotion, this should not occur, however there are ways to mitigate against this issue by increasing the cluster heartbeat time-out help prevent unnecessary DAG failovers.

To increase the cluster heartbeat timeout see: Tuning Failover Cluster Network Thresholds

Recommendations for vMotion:

1. Ensure vMotion is Active on 10Gb (or higher) adapters
2. Enable Multi-NIC vMotion across 2 x 10Gb adapters in environments with Exchange VMs larger than 64Gb RAM
3. Enable Enhanced vMotion Compatibility (EVC) to the highest supported level in your cluster
4. Use Jumbo Frames for vMotion Traffic
5. Ensure sufficient cluster capacity to migrate Exchange VMs
6. Use DRS rules to separate Exchange VMs to ensure vMotion is not prevented (as per Part 4)
7. When evacuating ESXi hosts running Exchange VMs, vMotion the Exchange VM first, and once it has succeeded, put the hosts into maintenance mode.
8. Use Network I/O Control (NIOC) to ensure a minimum level of bandwidth to vMotion (Further details in an upcoming post)
9. Do not Route vMotion Traffic
10. Put vMotion traffic on a dedicated non route-able VLAN (ie: No gateway)
11. Increase cluster heartbeat time-outs for Windows failover clustering with the maximums outlined in Tuning Failover Cluster Network Thresholds.

Back to the Index of How to successfully Virtualize MS Exchange.