VMware vSphere 7 with NVIDIA Multi-Instance GPUs (MIG) for Machine Learning Applications

Part 1 of this set of blogs introduces the core concepts in the new Multi-Instance GPUs (MIG) software functionality. We used MIG in technical preview on the NVIDIA A-series GPUs on vSphere 7 in the VMware labs. The MIG functionality optimizes the sharing of a physical GPU by a set of VMs on a vSphere 7 host in new ways. MIG is specifically for compute-intensive applications, such as machine learning workloads and it is not for graphics workloads. Part 2 goes into the detailed technical steps to set up MIG on vSphere 7. The MIG functionality is provided as part of the NVIDIA vGPU drivers (guest and host), starting with the R450 release.with CUDA 11 support.

Multi-instance GPUs is a new feature from NVIDIA that further enhances the vGPU approach to sharing the hardware. It does so by providing stricter isolation at the hardware level of a VM’s share of the GPU’s compute power and memory from others. MIG isolates the internal hardware paths (such as the L2 cache, the memory controllers, the address bus, the control and data cross-bars) that lead to any one vGPU’s share of the physical GPU’s memory and cores. This provides a more predictable level of performance for that vGPU and allows the team to pack more workloads onto the GPU itself, to optimize utilization. This is an ideal feature for providing GPU power as a service by a cloud provider, or by an internal IT department. The architecture for MIG is shown below. You can read more about the new MIG feature concepts in part 1 and delve into the technical setup steps on vSphere in part 2.

The MIG Architecture

Source: The NVIDIA MIG User Guide