Example Architectural Decision – ESXi Host Hardware Sizing (Example 1)

Problem Statement

What is the most suitable hardware specifications for this environments ESXi hosts?

Requirements

1. Support Virtual Machines of up to 16 vCPUs and 256GB RAM
2. Achieve up to 400% CPU overcommitment
3. Achieve up to 150% RAM overcommitment
4. Ensure cluster performance is both consistent & maximized
5. Support IP based storage (NFS & iSCSI)
6. The average VM size is 1vCPU / 4GB RAM
7. Cluster must support approx 1000 average size Virtual machines day 1
8. The solution should be scalable beyond 1000 VMs (Future-Proofing)
9. N+2 redundancy

Assumptions

1. vSphere 5.0 or later
2. vSphere Enterprise Plus licensing (to support Network I/O Control)
3. VMs range from Business Critical Application (BCAs) to non critical servers
4. Software licensing for applications being hosted in the environment are based on per vCPU OR per host where DRS “Must” rules can be used to isolate VMs to licensed ESXi hosts

Constraints

1. None

Motivation

1. Create a Scalable solution
2. Ensure high performance
3. Minimize HA overhead
4. Maximize flexibility

Architectural Decision

Use Two Socket Servers w/ >= 8 cores per socket with HT support (16 physical cores / 32 logical cores) , 256GB Ram , 2 x 10GB NICs

Justification

1. Two socket 8 core (or greater) CPUs with Hyper threading will provide flexibility for CPU scheduling of large numbers of diverse (vCPU sized) VMs to minimize CPU Ready (contention)

2. Using Two Socket servers of the proposed specification will support the required 1000 average sized VMs with 18 hosts with 11% reserved for HA to meet the required N+2 redundancy.

3. A cluster size of 18 hosts will deliver excellent cluster (DRS) efficiency / flexibility with minimal overhead for HA (Only 11%) thus ensuring cluster performance is both consistent & maximized.

4. The cluster can be expanded with up to 14 more hosts (to the 32 host cluster limit) in the event the average VM size is greater than anticipated or the customer experiences growth

5. Having 2 x 10GB connections should comfortably support the IP Storage / vMotion / FT and network data with minimal possibility of contention. In the event of contention Network I/O Control will be configured to minimize any impact (see Example VMware vNetworking Design w/ 2 x 10GB NICs)

6. RAM is one of the most common bottlenecks in a virtual environment, with 16 physical cores and 256GB RAM this equates to 16GB of RAM per physical core. For the average sized VM (1vCPU / 4GB RAM) this meets the CPU overcommitment target (up to 400%) with no RAM overcommitment to minimize the chance of RAM becoming the bottleneck

7. In the event of a host failure, the number of Virtual machines impacted will be up to 64 (based on the assumed average size VM) which is minimal when compared to a Four Socket ESXi host which would see 128 VMs impacted by a single host outage

8. If using Four socket ESXi hosts the cluster size would be approx 10 hosts and would require 20% of cluster resources would have to be reserved for HA to meet the N+2 redundancy requirement. This cluster size is less efficient from a DRS perspective and the HA overhead would equate to higher CapEx and as a result lower the ROI

9. The solution supports Virtual machines of up to 16 vCPUs and 256GB RAM although this size VM would be discouraged in favour of a scale out approach (where possible)

10. The cluster aligns with a virtualization friendly “Scale out” methodology

11. Using smaller hosts (either single socket, or less cores per socket) would not meet the requirement to support supports Virtual machines of up to 16 vCPUs and 256GB RAM , would likely require multiple clusters and require additional 10GB and 1GB cabling as compared to the Two Socket configuration

12. The two socket configuration allows the cluster to be scaled (expanded) at a very granular level (if required) to reduce CapEx expenditure and minimize waste/unused cluster capacity by adding larger hosts

13. Enabling features such as Distributed Power Management (DPM) are more attractive and lower risk for larger clusters and may result in lower environmental costs (ie: Power / Cooling)

Alternatives

1.  Use Four Socket Servers w/ >= 8 cores per socket , 512GB Ram , 4 x 10GB NICs
2.  Use Single Socket Servers w/ >= 8 cores , 128GB Ram , 2 x 10GB NICs
3. Use Two Socket Servers w/ >= 8 cores , 512GB Ram , 2 x 10GB NICs
4. Use Two Socket Servers w/ >= 8 cores , 384GB Ram , 2 x 10GB NICs
5. Have two clusters of 9 hosts with the recommended hardware specifications

Implications

1. Additional IP addresses for ESXi Management, vMotion, FT & Out of band management will be required as compared to a solution using larger hosts

2. Additional out of band management cabling will be required as compared to a solution using larger hosts

Related Articles

1. Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage (4 x 10 GB NICs)

2. Example VMware vNetworking Design w/ 2 x 10GB NICs

3. Network I/O Control Shares/Limits for ESXi Host using IP Storage

4. VMware Clusters – Scale up for Scale out?

5. Jumbo Frames for IP Storage (Do not use Jumbo Frames)

6. Jumbo Frames for IP Storage (Use Jumbo Frames)

CloudXClogo

 

Example Architectural Decision – Datastore (LUN) and Virtual Disk Provisioning (Thin on Thin)

Problem Statement

In a vSphere environment, What is the most suitable disk provisioning type to use for the LUN and the virtual machines to ensure minimum storage overhead and optimal performance?

Requirements

1. Ensure optimal storage capacity utilization
2. Ensure storage performance is both consistent & maximized

Assumptions

1. vSphere 5.0 or later
2. VAAI is supported and enabled
3. The time frame to order new hardware (eg: New Disk Shelves) is <= 4 weeks
4. The storage solution has tools for fast/easy capacity management

Constraints

1. Block Based Storage

Motivation

1. Increase flexibility
2. Ensure physical disk space is not unnecessarily wasted

Architectural Decision

“Thin Provision” the LUN at the Storage layer and “Thin Provision” the virtual machines at the VMware layer

(Optional) Do not present more LUNs (capacity) than you have underlying physical storage (Only over-commitment happens at the vSphere layer)

Justification

1. Capacity management can be easily managed by using storage vendor tools such eg: Netapp VSC / EMC VSI / Nutanix Command Center
2. Thin Provisioning minimizes the impact of situations where customers demand a lot of disk space up front when they only end up using a small portion of the available disk space
3. Increases flexibility as all unused capacity of all datastores and the underlying physical storage remains available
4. Creating VMs with “Thick Provisioned – Eager Zeroed” disks would unnessasarilly increase the provisioning time for new VMs
5. Creating VMs as “Thick Provisioned” (Eager or Lazy Zeroed) does not provide any significant benefit (ie: Performance) but adds a serious capacity penalty
6. Using Thin Provisioned LUNs increases the flexibility at the storage layer
7. VAAI automatically raises an alarm in vSphere if a Thin Provisioned datastore usage is at >= 75% of its capacity
8. The impact of SCSI reservations causing performance issues (increased latency) when thin provisioned virtual machines (VMDKs) grow is no longer an issue as the VAAI Atomic Test & Set (ATS) primitive alleviates the issue of SCSI reservations.
9. Thin provisioned VMs reduce the overhead for Storage vMotion , Cloning and Snapshot activities. Eg: For Storage vMotion it eliminates the requirement for Storage vMotion (or the array when offloaded by VAAI XCOPY Primitive) to relocate “White space”
10. Thin provisioning leaves maximum available free space on the physical spindles which should improve performance of the storage as a whole
11. Where there is a real or perceved issue with performance, any VM can be converted to Thick Provisioned using Storage vMotion not disruptivley.
12. Using Thin Provisioned LUNs with no actual over-commitment at the storage layer reduces any risk of out of space conditions while maintaining the flexibility and efficiency with significantly reduce risk and dependency on monitoring.
13. The VAAI UNMAP primitive provides automated space reclamation to reduce wasted space from files or VMs being deleted

Alternatives

1.  Thin Provision the LUN and thick provision virtual machine disks (VMDKs)
2.  Thick provision the LUN and thick provision virtual machine disks (VMDKs)
3.  Thick provision the LUN and thin provision virtual machine disks (VMDKs)

Implications

1. If the storage at the vSphere and array level is not properly monitored, out of space conditions may occur which will lead to downtime of VMs requiring disk space although VMs not requiring additional disk space can continue to operate even where there is no available space on the datastore
2. The storage may need to be monitored in multiple locations increasing BAU effort
3. It is possible for the vSphere layer to report sufficient free space when the underlying physical capacity is close to or entirely used
4. When migrating VMs from one thin provisioned datastore to another (ie: Storage vMotion), the storage vMotion will utilize additional space on the destination datastore (and underlying storage) while leaving the source thin provisioned datastore inflated even after successful completion of the storage vMotion.
5.While the VAAI UNMAP primitive provides automated space reclamation this is a post-process, as such you still need to maintain sufficient available capacity for VMs to grow prior to UNMAP reclaiming the dead space

Related Articles

1. Datastore (LUN) and Virtual Disk Provisioning (Thin on Thick)CloudXClogo

 

Example Architectural Decision – Single Sign On Configuration for Single Site w/ Multiple vCenter Servers

Problem Statement

What is the most suitable deployment mode for vCenter Single-Sign On (SSO) in an environment where there is a single physical datacenter with multiple vCenter servers?

Requirements

1. The solution must be a fully supported configuration
2. Meet/Exceed RTO of 4 hours
3. Support Single Pane of glass management
4. Ability to scale for future vCenters and/or datacenters

Assumptions

1. All vCenter instances can access the same Authentication source (Active Directory or OpenLDAP)

2. The average number of authentications per second for each SSO instance is <30 (Configuration Maximum)

Constraints

1. vCenter servers reside in different network security zones within the datacenter

Motivation

1. Future proof the environment

Architectural Decision

1. Use “Multi-site” SSO deployment mode

2. Use one SSO instance per vCenter

3. Each SSO instance will reside with the vCenter on a Windows 2008 x64 R2 virtual machine in a vSphere cluster with HA enabled

4. Each SSO instance will use the bundled SQL database

5. (Optional) For greater availability, vCenter Heartbeat can be used to protect each SSO instance along with vCenter and the bundled SSO database

6. The Virtual Machine hosting vCenter/SSO will be 2vCPU and 10GB RAM to support vCenter/SSO/Inventory Service and an additional 2GB RAM to support the bundled SSO Database

7. Using the bundled SSO database ensures only a single vCenter Heartbeat deployment is required to protect each vCenter/SSO instance and reduce Windows licensing

Justification

1. To simplify the maintenance/upgrade process for vCenter/SSO as different versions of vCenter cannot co-exist with the same SSO instance

2. If “High Availability” mode is used it would prevent single pane of glass management

3. “High Availability” mode currently requires an SSL load balancer to be configured as well as manual intervention which can be complicated and problematic to implement and support

4. “Basic” mode prevents the use of Linked Mode which will prevent the management of the environment being single pane of glass

5. Where vCenter servers reside in different network security zones, Using Multi-site mode allows each SSO instance to use authentication sources that are as logically close as possible while supporting single pane of glass management. This should provide faster access to authentication services as each SSO instance is configured with Active Directory servers located in the same or logically closest network security zone/s.

6. If one instance SSO goes offline for any reason, it will only impact a single vCenter server. It will not prevent authentication to the other vCenter servers.

7. Reduce the licensing costs for Microsoft Windows 2008 by combining SSO and vCenter roles onto a single OS

Alternatives

1. Use “Basic” Mode, resulting in a standalone version of SSO for each vCenter server with no single pane of glass management

2. Use “High Availability” mode per vCenter

3. Use a shared “High Availability” mode for all vCenters in the datacenter

4. In any SSO configuration, Host the SSO database (per vCenter) on a Oracle OR SQL Server

5. Run SSO on a dedicated Windows 2008 instance with or without the SSO database locally

6. Run a single SSO instance in “Multi-Site” mode , use vCenter Heartbeat to protect SSO (including the database) and share the SSO instance with all vCenters

Implications

1. Where SSO is not protected by vCenter Heartbeat (optional), SSO for each vCenter is a Single point of failure where authentication to the affected vCenter will fail

2. “Multi-Site” mode requires the install-able version of SSO, which is Windows Only which prevents the use of the vCenter Server Appliance (VCSA) as it only supports basic mode.

Related Articles

1. vSphere 5.1 Single Sign On (SSO) deployment mode across Active/Active Datacenters

2. vSphere 5.1 Single Sign On (SSO) Architectural Decision Flowchart

3. Disabling Single Sign On – Dont Do It! – Michael Webster (VCDX#66) @vcdxnz001

CloudXClogo