Enterprise HPC Infrastructure teams supporting engineers have a challenging task keeping up with the latest advances in Hardware and capabilities. Until recently, engineers relied on their workstations and bare metal compute clusters for their HPC simulations. The downside, a workstation is not suitable for running very large, compute intensive simulation jobs and bare metal clusters are not the best solution for an organization. Legacy HPC are usually independent HPC Silos per organization resulting in underutilization of resources. In the article VMWare Solutions Enhance HPC, we showed the tremendous benefits of consolidation and virtualizing HPC infrastructure compared to bare metal environments
Cloud like Capabilities for HPC environments:
HPC Cloud solution providers like UberCloud can combine powerful virtualized HPC hardware with superior management capabilities to help, IT dynamically provision and manage HPC clusters, deploy simulation software applications, monitor their spend, and give them complete control. In this solution we combine the capabilities of the VMware platform, with the solution provided by HPC Cloud provider UberCloud. The UberCloud solution leverages the automation capabilities of Terraform with the unique packaging and distributing capabilities of docker containers to dynamically deploy HPC applications on the vSphere platform. Containers complement VMs in application management and distribution. By sandboxing applications in containers, applications become portable, and the same application can be deployed to different VMs without requiring creating multiple variations of VM images for every application. It speeds up the development and testing, yielding significant acceleration to the change management process. Containers that fail can be removed and replaced without noticeable impact on service.
The core capabilities that are enabled by the solution include:
- Provisioning and managing HPC clustersthat are easily setup for users in minutes.
- Rapidly deploying ready-to-run instances with pre-installed HPC software eliminating the need for complex software installation and configuration.
- Optimized use of resources shared across multiple groups of HPC users with scale up and scale down capabilities.
UberCloud on vSphere Solution:
In this solution we deployed two different use cases
- Single node deployment with all components included
- Distributed deployment with a head node, a GUI node and three worker nodes for a distributed workload
Single Node Deployment:
This is a simple deployment of UberCloud on vSphere. A Single VM is deployed and all components are downloaded and installed dynamically. The single node includes the head node, GUI node and the worker node components.
The vCenter access and the components of the deployment are defined in the Terraform file. The single virtual machine with the requisite cores and memory is automatically created in the vCenter at the specified location and the template UberCloud VM image is deployed. Followed by deployment the requisite docker images are downloaded and installed automatically.
The Terraform script is launched and is use to create the VM, download and install docker, configure the application and verify its readiness for use as shown below
Figure 1: Virtual machine deployment and configuration
Figure 2: Docker components and containers with HPC applications are installed
Figure 3: Virtual machine representing the single node deployment
Once all the components are deployed successfully the end user is sent an email with a link to login to the environment along with a uniquely generated password as shown below.
Figure 4: UberCloud email with remote access details
The link and the password can be used to login to the environment with a browser and the user interface on login is as shown below.
Figure 5: ANSYS application running on the vSphere based private cloud
HPC uses distributed computed pervasively to solve many of the most difficult problems. Ansys provides the Distributed Solve Option (DSO) as a productivity enhancement tool that accelerates sweeps of design variations by distributing the design parameters across a network of processors. A distributed solution was deployed with UberCloud and ANSYS leveraging Terraform automation as a second phase of the solution.
Distributed Computing solves many high-level business challenges:
- Enterprises can more effectively and robustly utilize their compute resources to optimize their designs.
- Design iterations are faster, delivering new products to market in a fraction of the time.
- Design evaluation is approaching true scalability.
- Businesses can use this flexible heterogeneous computing infrastructure in private and public cloud environments to solve engineering challenges locally or tie into any compute resources worldwide.
Using similar Terraform based automation a distributed deployment of ANSYS HPC components was accomplished. The virtual machines deployed for the distributed solution as shown below. This deployment uses a GUI node, a head node and three compute nodes. The GUI node was allocated a NVIDIA vGPU for good graphical performance leverage vSphere capabilities with acclerators.
Figure 6: Virtual components of distributed HPC workload
The solution was deployed and a distributed test script was executed on the HPC cluster leveraging ANSYS FLUENT. The script shown below leverages MPI over TCP and uses 12 processors.
Figure 7: Script for test use case
The image below shows execution of the script with the three worker nodes.
Figure 8: Execution of distributed workload
The computation proceeds to convergence over multiple steps as show below.
Figure 9: Convergence of the distributed computation
vSphere is an excellent platform for High Performance computing. This UberCloud based HPC solution on vSphere was deployed with Terraform and successfully demonstrated. All the applications were containerized and hosted within vSphere. The solution show cased the ability to fully package a HPC application with automated deployment and tear down in a matter of minutes. HPC users can be highly productive and get the environment ready on demand rather than having to wait for many weeks or months in the case of bare metal environments. The solution can scale to tens of nodes and also be leveraged for distributed computing.
The full paper with the Terraform code listing is available at this location.
Special thanks to t