While virtualization technologies have proven themselves in the enterprise with cost effective, scalable and reliable IT computing, High Performance Computing (HPC) however has not evolved and is still bound to dedicating physical resources to obtain explicit runtimes and maximum performance. High Performance Computing (HPC) workloads have been traditionally run on bare-metal, non-virtualized clusters. Virtualization can provide great benefits and transform HPC infrastructure to be efficient, resilient, secure and flexible. In this blog article we will look at some of the core challenges in HPC infrastructure and how the VMware platform helps address them.
Core challenges with Bare Metal High Performance Computing environments
No capability for prioritizing workloads for multiple tenants
- Every group or department has its own dedicated HPC environment
- Security requirements across groups prevent the sharing and optimized use of infrastructure
- Isolation of the different HPC environments leads to low utilization limits the compute capability
- No ability to partition and prioritize compute, storage and network resources across different tenants
Lack of Load Balancing
- Static nature of bare metal environments makes it inflexible
- Bare metal HPC clusters are bound by the physical limits of the HW and is prone to load imbalances
No High Availability for critical components
- Bare Metal HPC cluster has no inbuilt mechanisms to deal single points of failure
- HW Maintenance can result in downtime and productivity loss
Workload state reproducibility
- Bare metal HPC environments do not have the capability to capture and reproduce the state of workloads
- There is a unmet need for reproducibility of workloads for analysis and troubleshooting
VMware Platform can address these HPC Challenges
VMware brings a number of innovative technologies to the HPC field that enhances its full potential. Let’s see how this works. Virtual machines can be tailored to each job, with specific Memory size and CPU count, and specific versions of OS, application and software libraries with minimal impact on performance compared to bare metal.
- The isolation between Virtual Machines yield a number of benefits.
- The amount of resources granted to individual jobs is controllable by policy.
- A fault or crash in an application or OS doesn’t affect any other running jobs.
- And because of security isolation, colocation of workloads from different groups is possible. You can even grant root access in one VM without fear of compromising others.
- By combining servers into a single pool, computational resources can be shared across all the HPC workloads in your organization that include specialized hardware accelerators such as GPU’s and FPGA’s.
- If the resources on one server become overloaded, jobs can be automatically migrated live to other servers without disruption, thus maintaining balance of utilization across the environment.
- Should one server experience a hardware failure, jobs are automatically restarted elsewhere in the cluster, enabling greater resiliency and reduced downtime.
- Micro-segmentation capabilities available in VMware NSX provides the ability to securely isolate workloads across different groups even within the same cluster providing for secure multi-tenancy.
Virtualized HPC Benefits
We show an example cluster that is dedicated for HPC workloads as shown. In a typical enterprise many isolated HPC environments can be brought together under vSphere and the nodes consolidated under multiple cluster. Multiple applications such as Bioinfo, Polybench & RayTracing share the same cluster as shown
Within a cluster, resource pools are used to resource allocation of workloads by reserving and priori†izing compute resources. Resource pool information for one of the HPC workloads is shown. Compute resource settings reflect different priorities for the workloads. The BioInfo resource pool has high priority reservations for compute resources. All BioInfo workloads will be located within this resource pool.
vSphere DRS for Cluster Optimization
vSphere DRS ensures that applications can run properly on the same cluster sharing the same physical resources, by load balancing and optimizing the cluster resources.
DRS resolves performance imbalances for the workloads running in the HPC cluster by leveraging vMotion to balance the utilization.
In the example shown below, DRS is enabled in automatic mode and an aggressive threshold to show its impact on balancing the utilization in the cluster. We see under ‘Recent Tasks’ that DRS has initiated movement of virtual machines to balance out the cluster.
Capturing Workload state
vSphere snapshot and cloning capabilities can be utilized to reproduce state of HPC workloads. There are many instances where HPC workloads need to be reproduced and the state of a system needs to be captured. vSphere provides snapshot and cloning capabilities that can be used to restart a workload from the point that it was captured, as and when desired.
The cloning process shown for the poly head virtual machine takes a point in time snapshot of the running virtual machine and clones it as a means to capture the state.
vSphere HA for critical HPC components
vSphere HA can help protect critical HPC virtual machines and improve their availability. The head node for HPC applications is typically a single point of failure. The head node is running in host sc2esx05 and a failure of the host is simulated.
vSphere HA detects the failure and automatically brings up all the machines that went down on other nodes in the cluster. The head node is now running in host sc2esx06. vSphere HA helps reduce unplanned downtime for critical HPC virtual machines such as the head node.
Multi-tenancy for HPC
Micro-segmentation is a network security technique that enables security architects to logically divide the data center into distinct security segments down to the individual workload level, and then define security controls and deliver services for each unique segment.
In our POC environment, we leverage resource pools to set boundaries on compute utilization. The distributed logical router with VMware NSX provides east-west traffic security in the environment and can provide security at a virtual machine level. We can group workloads by application and provide security with NSX. The same resource pools can also be used as boundaries between the three tenants in the environment.
The image shows the use of NSX distributed firewall to secure traffic between the three different workloads. Multi-tenancy can be flexibly deployed for HPC leveraging VMware NSX.
VMware’s proven, enterprise-class virtualization technologies can be leveraged to increase operational efficiency, reduce complexity and ensure greater workload security for HPC workloads. The use cases demonstrate that the VMware Platform can help mitigate many of the above noted shortcomings of bare metal environments.
Mohan Potheri (VMware) & Michael Cui (VMware)