Example Architectural Decision – Virtual Switch Load Balancing Policy

Problem Statement

What is the most suitable network adapter load balancing policy to be configured on the vSwitch & dvSwitch/es where 10Gb adapters are being used for dvSwitches and 1Gb for vSwitch which is only used for ESXi management traffic?

Assumptions

1. vSphere 4.1 or later

Motivation

1. Ensure optimal performance and redundancy for the network
2. Simplify the solution without compromising performance for functionality

Architectural Decision

Use “Route based on physical NIC load” for Distributed Virtual switches and “Route based on originating port ID” for vSwitches.

Justification

1. Route based on physical NIC load achieves both availability and performance
2. Requires only a basic switch configuration (802.1q and the required VLANs tagged)
3. Where a single pNIC’s utilization exceeds 75% the “route based on physical NIC load” will dynamically balance workloads to ensure the best possible performance

Implications

1. If NFS IP storage is used with a single VMKernel it will not use both connections concurrently. If using multiple 10GB connections for NFS traffic is required then two or more VLANs should be created with one VMK per VLAN. If only one VMK is used, the only option if you want traffic to go down multiple uplinks would be to use “Route based on IP hash” and have Etherchannel configured on the physical switch.

Alternatives

1. Route based on the originating port ID

Pros: Chooses an uplink based on the virtual port where the traffic entered the virtual switch. The virtual machine outbound traffic is mapped to a specific physical NIC based on the ID of the virtual port to which this virtual machine is connected. This method is simple and fast, and does not require the VMkernel to examine the frame for necessary information.

Cons: When the load is distributed in the NIC team using the port-based method, no virtual machine single-NIC will ever get more bandwidth than can be provided by a single physical adapter.

2. Route based on IP hash.

Pros: Chooses an uplink based on a hash of the source and destination IP addresses of each packet. For non-IP packets, whatever is at those offsets is used to compute the hash. In this method, a NIC for each outbound packet is chosen based on its source and destination IP address. This method has a better distribution of traffic across physical NICs.

When the load is distributed in the NIC team using the IP-based method, a virtual machine single-NIC might use the bandwidth of multiple physical adapters.

Cons: This method has higher CPU overhead and is not compatible with all switches (it requires IEEE 802.3ad link aggregation support).

3. Route based on source MAC hash

Pros: Chooses an uplink based on a hash of the source Ethernet. This method is compatible with all physical switches. The virtual machine outbound traffic is mapped to a specific physical NIC based on the virtual NIC’s MAC address.

Cons: This method has low overhead, and might not spread traffic evenly across the physical NICs.

When the load is distributed in the NIC team using the MAC-based method, no virtual machine single-NIC will ever get more bandwidth than can be provided by a single physical adapter.

4. Use explicit fail-over order

Pros: Always uses the highest order uplink from the list of Active adapters which passes failover detection criteria.

Cons: This setting is equivalent to a fail over policy and is not strictly a load balancing policy.

5. Route based on Physical NIC load

Pros: Most efficient load balancing mechanism because it is base on the actual physical NIC workload.

Cons: Not available on standard vSwitches

For further information on the topic checkout the below two articles by a couple of very knowledgable VCDX’s

Michael Webster – Etherchanneling or Load based teaming?
Frank Denneman – IP Hash verses LBT

12 thoughts on “Example Architectural Decision – Virtual Switch Load Balancing Policy

  1. Great analysis. We are using a dual 10 GB NIC for both vm port traffic and NFS IP storage and use use Route based on originating port ID based on Intel’s best practices from nearly three years ago. I’d like to use the based on physical load option, but that did not seem to work as well in our environment and wanted to stay aaway from IP Hash just to keep things simple. This configuration has worked well in our environemnt. I’d like to use route based on physical load, but by sharing the ports and NICs with NFS IP traffic, it hinders that.

    • Hi Tom,

      Checkout Netapp’s Technical Report TR-3749 on Pg 37 through pg 43. This solution allows LBT to be used, and ensure your NFS traffic uses both 10Gb Links.

      However, If your not having issues with the current setup, I wouldn’t recommend changing just for the sake of it.

      Hope that helps.

      • Yes, and unfortunately that is not standard based.
        If the physical switches do support some vendor specific multi-switch etherchannel setup then IP Hash could very well be the best NIC Teaming policy, but to the prize of increased complexity and bound to some vendor proprietary solution.

        • Agreed, LBT is generally the easier and the best.
          With NFS traffic running over two VMKernel’s rather than IP Hash is a much simpler solution, and would still give excellent performance over 10Gb connections without complexity. Thanks for the comment.

  2. Wouldn’t there be a problem if you have IP hash on NFS and Virtual Port ID / LBT on other port groups going out the same uplinks ?
    There are mutually exclusive as far as I know.

    I also like the LBT route even if it multiplies the number of VLAN /port groups needed for NFS versus IP hash.

  3. Pingback: 【虚拟化实战】网络设计之四Teaming | arc5 - riaos.com

  4. Quick question Josh, if the VLAN ID is set to 4095 on a portgroup on a switch(vSS or vDS), does Promiscuous Mode need to be set to “Accept” for VM’s on that port group to be able to receive tagged traffic? I have been under the impression that if the VLAN ID is set to 4095, VM’s will sniff all traffic regardless of the Promiscuous Mode setting. Am I correct?

    I understand that the trunking driver needs to be installed on those VM’s and the physical switch port needs to be a trunk port.