High Performance Computing (HPC) is the use of parallel-processing techniques to solve complex computational problems. HPC systems have the ability to deliver sustained performance through the concurrent use of distributed computing resources, and they are typically used for solving advanced scientific and engineering problems, such as computational fluid dynamics, bioinformatics, molecular dynamics, weather modeling and deep learning with neural networks.
With continued efforts to improve performance to levels near that of bare-metal HPC environments, the trend toward virtualized HPC (vHPC) is rapidly growing. This is particularly true for enterprise-grade workloads like machine learning and deep learning.
Due to their extreme demand on performance, HPC workloads often have much more intensive resource requirements than those workloads found in the typical enterprise. For example, HPC commonly leverages hardware accelerators, such as GPU and FPGA for compute as well as RDMA interconnects for fast communication, which require special vSphere configurations.
This toolkit is intended to facilitate managing the lifecycle of these special configurations by leveraging vSphere APIs. It also includes features that help vSphere administrators perform some common vSphere tasks that are related to creating such high-performing environments, such as VM cloning, setting Latency Sensitivity, and sizing vCPUs, memory, etc.
This toolkit is one of VMware open source projects under Apache 2 license. It can be obtained at VMware Github vmware/vhpc-toolkit. The toolkit can be also linked through VMware OCTO Flings program – Virtualized High Performance Computing Toolkit.
This toolkit currently supports vSphere version 6.5 or 6.7 and requires Python 3. Follow the instruction in README to pip install all required packages and set up vCenter IP/FQDN in a vCenter.conf file. After properly setting the vCenter.conf file, you should be able to execute vhpc_toolkit under bin folder to enter interactive shell and perform all available operations. For example,
1 2 3 4 5 6 7 8 9 10 |
./vhpc_toolkit vCenter password: (enter your vCenter password) Welcome to the vHPC Toolkit Shell. Type help or ? to list commands. Type exit to exit the shell. vhpc_toolkit> help Documented commands (type help <topic>): ======================================== clone cpumem dvs help network passthru power sriov vgpu cluster destroy exit latency network_cfg post pvrdma svs view |
There are two categories of functions in this toolkit: (1) configuration of vHPC environments; (2) vHPC cluster creation and destruction using a configuration file.
Configuration of vHPC Environments
Contents
Using this toolkit, we can easily apply the following operations to a single VM or a list of VMs:
- Perform common vSphere tasks, such as cloning VMs, configuring vCPUs, memory, reservations, shares, Latency Sensitivity, Distributed Virtual Switch/Standard Virtual Switch, network adapters and network configurations
- Configure PCIe devices in DirectPath I/O mode, such as GPU, FPGA and RDMA interconnects
- Configure NVIDIA vGPU
- Configure RDMA SR-IOV (Single Root I/O Virtualization)
- Configure PVRDMA (Paravirtualized RDMA)
Below illustrates the usage of some commands.
Clone and Customize VM
Clone multiple VMs based on a template named “vhpc_clone” with specified CPU and memory customization:
vhpc_toolkit> clone —template vhpc_clone —datacenter HPC_Datacenter —cluster COMPUTE_GPU_Cluster —datastore COMPUTE01_vsanDatastore —memory 8 —cpu 8 –-file VM–file |
where VM-file is name of the file containing a list of cloned VM names, one per line.
Configure GPU DirectPath I/O (Passthrough)
Add GPU device 0000:84:00.0 in Passthrough mode into each above cloned VM:
vhpc_toolkit> passthru —add —device 0000:84:00.0 —file VM–file |
where “0000:84.00” is the SBDF address (segmentBus:device.function) for the GPU device. This value can be found at “Host” -> “Configure” -> “Hardware” -> “PCI Devices” in vCenter.
Configure NVIDIA vGPU
Or add NVIDIA vGPU with vGPU profile grid_p100-4q (NVIDIA P100) into each cloned VM:
vhpc_toolkit> vgpu —add —profile grid_p100–4q —file VM–file |
where the profile represents the vGPU type and “4q” refers to the vGPU’s memory size 4GB.
Configure CPU/Memory Reservation and Latency Sensitivity
vhpc_toolkit> cpumem —cpu_reservation yes —file VM–file vhpc_toolkit> cpumem —memory_reservation yes —file VM–file vhpc_toolkit> latency —level high —file VM–file |
The above three commands reserve CPUs as well as memory and set “Latency Sensitivity” to “High” for each VM in the VM-file.
Execute Post Scripts in Guest OS
vhpc_toolkit> post —script ../examples/post–scripts/install_cuda.sh —guest_username vmware —file VM–file |
It will prompt you guest OS password for executing the installation script. This function helps facilate some guest OS customization after provisioning VMs.
For more examples, please refer to sample operations in the project docs.
vHPC Cluster Creation and Destruction using a Configuration File
This function can help vSphere administrators create/destroy virtual HPC clusters using a cluster configuration file as input.
For example, create a cluster based on the cluster configuration file “cluster.conf”:
vhpc_toolkit> cluster —create —file cluster.conf |
Similarly, destroy the cluster:
vhpc_toolkit> cluster —destroy —file cluster.conf |
The cluster configuration file allows you to easily define an HPC/ML cluster with VMs with all kinds of special attributes. Here is a sample cluster configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# =========================COMMENT============================== # Virtualized High Performance Computing Toolkit # Example of creating four VMs # Each VM has a NVIDIA V100 vGPU configured in SharePassthrough (vGPU) mode # Each of VM has 5 vCPUs and 16 GB memory # Modify the file according to your platform and needs # =========================DEFINITION STARTS===================== [BASE] template: vhpc_clone cpu: 5 memory: 16 datacenter: HPC_Datacenter cluster: COMPUTE_GPU_Cluster host: vhpc–esx–05.hpc.vmware.com datastore: COMPUTE01_vsanDatastore linked: 0 [NETWORK] is_dhcp: true port_group: vHPC–PG–VMNetwork domain: hpc.vmware.com netmask: 255.255.255.0 gateway: 172.1.101.1 dns: [‘172.1.110.1’, ‘172.1.110.2’] [VGPU] vgpu: grid_p100–4q [LATENCY] latency: high cpu_reservation: true memory_reservation: true [_VMS_] vgpu–vm{1:4}: BASE NETWORK LATENCY VGPU # =========================DEFINITION ENDS======================== |
You can define different virtual clusters with a variety of configurations, including VMs with GPU Passthrough, InfiniBand Passthrough/SR-IOV, RoCE (RDMA over Converged Ethernet) Passthrough/SR-IOV/PVRDMA. For details on the syntax and more sample files of defining different virtualized HPC/ML clusters, you are welcome to read the README and sample cluster configuration files in the project.
The toolkit is also built with extensibility in mind. It is easy to add additional operations that are currently not supported.
Feel free to try out the tool and, as always, we strongly encourage you to report bugs and suggest improvements. We also welcome contributions to the tool from the community!