Example Architectural Decision – Datastore (LUN) and Virtual Disk Provisioning

Problem Statement

In a vSphere environment, What is the most suitable disk provisioning type to use for the LUN and the virtual machines to ensure minimum storage overhead and optimal performance?

Requirements

1. Ensure optimal storage capacity utilization
2. Ensure storage performance is both consistent & maximized

Assumptions

1. vSphere 4.1 or later
2. VAAI is supported and enabled
3. Array level data replication is being used throughout the environment
4. Monitoring of the environment (including vSphere and Storage) is a manual process
5. The time frame to order new hardware (eg: New Disk Shelves) is a minimum of 3 months

Constraints

1. Block based storage

Motivation

1. Increase flexibility
2. Ensure physical disk space is not unnecessarily wasted

Architectural Decision

“Thick Provision” the LUN at the Storage layer and “Thin Provision” the virtual machines at the VMware layer

Justification

1. Simplified capacity management as only one layer (vSphere layer) needs to be monitored for capacity
2. The Free space shown by vSphere is actual usable storage
3. Reduces the chance of an “Out of Space” condition
4. Increases flexibility as all unused capacity of all datastores remains available
5. Creating VMs with “Thick Provisioned – Eager Zeroed” disks would increase the provisioning time
6. Creating VMs as “Thick Provisioned” (Eager or Lazy Zeroed) does not provide any significant benefit but adds a serious capacity penalty
7. Using Thin Provisioned virtual machines minimizes storage replication traffic on creation of virtual machines
8. Using Thick Provisioned LUNs reduces the requirement for fast turn around times for purchasing additional capacity
9. Monitoring is essential to successfully and safely use “Thin on Thin”

Alternatives

1.  Thin Provision the LUN and thick provision virtual machine disks (VMDKs)
2.  Thick provision the LUN and thick provision virtual machine disks (VMDKs)
3.  Thin provision the LUN and thin provision virtual machine disks (VMDKs)

Implications

1. No storage over commitment can occur on the physical array
2. The storage “consumed” will be reported differently between the vSphere Administrator and the Storage Administrator. The vSphere Administrator will see the true utilization, whereas the SAN administrator will see the “Consumed” & “Provisioned” values as the same
3. It is possible for a datastore to become overcommited, and as a result if not monitored the datastore may run out of free space which would result in an outage.

Related Articles

1. Datastore (LUN) and Virtual Disk Provisioning (Thin on Thin)

vmware_logo_ads

Example Architectural Decision – Securing vMotion & Fault Tolerance Traffic in IaaS/Cloud Environments

Problem Statement

vMotion and Fault tolerance logging traffic is unencrypted and anyone with access to the same VLAN/network could potentially view and/or compromise this traffic. How can the environment be made as secure as possible to ensure security between in a multi-tenant/multi-department environment?

Assumptions

1.  vMotion and FT is required in the vSphere cluster/s (although FT is currently not supported for VMs hosted with vCloud Director)
2. IP Storage is being used and vNetworking has 2 x 10GB for non Virtual Machine traffic such as VMKernel’s & 2 x 10GB NICs are available for Virtual Machine traffic (Similar to Example vNetworking Design for IP Storage)
3. VI3 or later

Motivation

1. Ensure maximum security and performance for vMotion and FT traffic
2. Prevent vMotion and/or FT traffic impacting production virtual machines

Architectural Decision

vMotion & Fault tolerance logging traffic will each have a dedicated non routable VLAN which will be hosted on a dvSwitch which is physically separate from virtual machine distributed virtual switch.

Justification

1.  vMotion / FT traffic does not require external (or public) access
2. A VLAN per function ensures maximum security / performance with minimal design / implementation overhead
3. Prevent vMotion and/or FT traffic potentially impacting production virtual machine and vice versa by having the traffic share one or more broadcast domain/s
4. Ensure vMotion/FT traffic cannot leave there respective dedicated VLAN/s and potentially be sniffed

Implications

1. Two (2) VLANs with private IP ranges are required to be presented over 802.1q connections to the appropriate pNICs

Alternatives

1.  vMotion / FT share the ESXi management VLAN – This would increase risk of traffic being intercepted and “sniffed”
2. vMotion / FT share a dvSwitch with Virtual Machine networks while still running within dedicated non routable VLANs over 802.1q

Example Architectural Decision – Storage I/O control for IaaS solutions

Problem Statement

In vSphere clusters servicing IaaS or Cloud workloads where customers or departments have the ability to self provision virtual machines with varying storage I/O requirements, how can the cluster be configured to ensure the most consistent virtual machine performance from a storage perspective?

Assumptions

1. vSphere 5.1 or later (to support both VMFS and NFS datastores and SIOC Automatic latency threshold computation)

Motivation

1. Ensure consistent storage performance for all virtual machines
2. Prevent a single virtual machine preventing other virtual machines reasonable access to storage

Architectural Decision

Enable Storage I/O control for all datastores and leave the shares values at the default setting for all virtual machines.

Set Tier 1 storage congestion threshold to 10ms – eg: SSD or SAS 15k RPM
Set Tier 2 storage congestion threshold to 20ms – eg: 15k or 10k SAS
Set Tier 3 storage congestion threshold to 30ms – eg: 7.2k SATA

Justification

1. In a IaaS or Cloud environment, it is important to prevent intentional or unintentional DoS type attacks; Storage I/O control will prevent such activities by giving equal access to the storage for all virtual machines attempting concurrent access.
2. Ensure no virtual machine/s monopolize the available I/O of the underlying storage eg: The noisy neighbor issue
3. Storage I/O control ensures consistent access across all ESXi hosts with access to the datastore, not just a single host. This ensures equal I/O access across the environment, not just across a single ESXi host.
4. Tier 1 should maintain lower latency than lower Tier disk, as such, a lower congestion threshold is advisable to ensure optimal performance for virtual machines hosted on Tier 1
5. Virtual machines requiring significant I/O will not be significantly impacted by Storage I/O control (assuming the congestion threshold is reached) as other VMs requiring access to storage will be able to access storage (thanks to Storage I/O control) and complete any required I/O in a timely manner and once the I/O is completed, no longer impact performance at all.
6. Virtual Machine not accessing storage regularly will not impact the VMs accessing storage regularly as Storage I/O control only acts on VMs accessing storage concurrently.
7. Leaving VMs with the default share value decreases administrative overhead and prevents human error granting significantly higher (or lower) share values which may negatively impact performance for one or more VMs

Implications

1. When using Storage DRS with SIOC the Storage DRS I/O latency setting needs to be carefully considered. Setting these value below the SOIC values (assuming Manual latency values are set) is recommended to ensure Storage DRS can work towards evenly balancing the storage workload and improving overall performance & SIOC then can help ensure consistent performance by taking action when the congestion threshold is reached to minimize latency spikes.

Alternatives

1. For vSphere 5.1 environments use the “Automatic Latency Threshold” by selecting the “Percentage of Peak Throughput” and setting the percentage value to “90%”. This setting is designed to minimize the change of a misaligned congestion threshold being manually set, therefore potentially reducing the effectiveness of SIOC
2. Not enable Storage I/O control
3. Enable Storage I/O control and set higher than default share values on critical VMs