High Performance Computing (HPC) workloads are increasingly run on VMware vSphere. HPC throughput workloads typically scale well in vSphere environments. The goal of this proof of concept is to validate scalability and computational performance for HPC Throughput workloads in the VMware Cloud on AWS.
Proof of Concept for HPC Throughput workloads with VMware Cloud on AWS
The goal of the proof of concept was to validate that the VMware Cloud on AWS infrastructure is viable and performant for HPC throughput workloads. Testing was done with throughput workloads to ensure that they were functional and ran well in the environment. The VMware Cloud on AWS is powered by VMware Cloud Foundation, which is the unified SDDC platform and it is the integrated cloud infrastructure platform for a hybrid cloud. With the new public cloud infrastructure, available with VMware Cloud on AWS, the goal was to run a common set of HPC throughput Benchmarks to evaluate the compute performance and the scalability of the infrastructure.
Quick Background on VMware Cloud on AWS
Figure 1: VMware Cloud on AWS Overview
VMware Cloud on AWS is an on-demand service that enables you to run applications across vSphere-based cloud environments with access to a broad range of AWS services. Powered by VMware Cloud Foundation, this service integrates vSphere, vSAN and NSX along with VMware vCenter management, and is optimized to run on dedicated, elastic, bare-metal AWS infrastructure.
HPC is defined as the use of multiple computers and parallel-processing techniques to solve complex computational problems. HPC focuses on developing parallel-processing algorithms and systems by incorporating both throughput-based and parallel-computational techniques. HPC is typically used for solving advanced technical problems and performing research activities through computer modeling, simulation and analysis. HPC systems have the ability to deliver sustained performance through the concurrent use of computing resources.
HPC has evolved to meet increasing demands for processing capabilities. It brings together several technologies such as computer architecture, algorithms, programs and electronics, and system software under a single canopy to solve advanced problems effectively and quickly. HPC has a broad adoption in multidisciplinary areas including:
- Energy, oil and gas
- Electronic design-automation
- Environment and weather forecasting
- Finance services
- Geographical data
- Media and entertainment
- Machine learning/deep Learning
One of the common types of HPC workloads is “Throughput”. In throughput workloads, multiple individual jobs run simultaneously to complete a given task, each running independently with no communication between jobs. Typical throughput workloads include Monte Carlo simulations in financial risk-analysis, digital movie-rendering, electronic design-automation and genomics analysis, where each program runs in a long-time scale or features hundreds, thousands or even millions of executions with varying inputs.
HPC Throughput Proof of concept environment on VMware Cloud on AWS
We test the performance of a mixture of HPC throughput workloads represented by the BioPerf test suite. BioPerf is a benchmark suite that represents the typical characteristics of bioinformatics applications and is a good proxy for throughput based HPC applications. We used 7 of the 10 bioinformatics packages to validate the infrastructure. These seven 7workloads were chosen as they were the most CPU intensive and are able to push the server utilization to 90-100% with their workload mixes.
As shown in figure 2, in these tests the virtual machines were sized to maximize the utilization of VMware Cloud on AWS cluster, which consists of four ESXi hosts. One of the ESX hosts was used for the master VM which served as Torque resource manager, and the master VM was sized to consume all resource of the ESXi host. Other three ESXi hosts were running six compute VMs which served as Torque compute nodes. The six compute VMs were sized one per socket to benefit from NUMA and to consume the compute capability of the cluster. Centos 7 64-bit Linux was used for all compute and master VMs.
Figure 2:Cluster configuration of HPC throughput testing on VMware Cloud on AWS,
The cluster configuration is shown in figure 3. Virtual Machines hpc-01 ~ hpc-06 are the compute VMs (Torque compute nodes) and VM “torquehead” is the master VM (Torque resource manager)
Compute VM Profile
As shown in figure 3, the compute VM uses 18 vCPUs using all the 18 physical cores in a socket of the ESX host. The six compute VMs consume three of the four ESX hosts in the cluster.
Figure 3: The configuration of the compute VM
Master VM Profile
As shown in figure 4, the typical sizing for the master VM involves dedicating a full host of compute resources. Therefore, this Torque Manager VM is sized to use all 36 cores on the ESX host.
Figure 4: The configuration of the master VM
AWS EFS Configuration for shared NFS filesystem used in throughput testing
Throughput workloads need a shared filesystem for sharing executables, inputs and outputs of applications across the compute and master VMs. In bare-metal Linux environments this need is satisfied by using an NFS (network file system). Since VMware Cloud on AWS runs in AWS, it can natively access all AWS services through the high-speed internal network backbone, including AWS EFS. AWS EFS provides NFS compatible services for workloads running on AWS. In this PoC, we are leveraging this capability by using both vSphere VMs and the AWS EFS services. EFS was configured for use with VMware Cloud on AWS per instructions provided in this blog. EFS was created using the AWS console and setup for maximum I/O performance. The appropriate security groups and rules were established for the VMware Cloud on AWS cluster to communicate via NFS protocol with the EFS service.
Figure 5: Leveraging AWS EFS in VMware Cloud on AWS for application storage
NFS Guest Configuration
Based on the IP address of the EFS service for the appropriate availability zone, the compute and the master VMs were configured to mount the EFS datastore via NFS.
Figure 6: EFS Mount details showing df output and contents of /etc/fstab
The Torque scheduler was used to run multiple iterations of these 7 benchmark applications and each of them repeated 100 times and the time for completion was measured. Two types of testing were performed:
- Compute VM scalability test.
- Hyperthreading validation
Compute VM Scalability Test
The test started running on one compute VM and then scaled incrementally to six compute VMs. The results are shown in the graph below with both the wall-clock execution time and speed-up.
Figure 7: The scalability test of running high throughput workloads on incremental compute VMs
The results clearly show that the time reduces linearly with the number of compute VMs. The infrastructure at VMware Cloud for AWS for HPC throughput workloads can scale linearly.
vSphere Hyper-threading Validation
With hyper-threading turned on, the test was repeated with changing the number of job slots per compute VM from 18 to 36, which is equal to the number of logical cores per socket. The results for this test are shown below:
Figure 8: Performance effect of using hyperthreading, where np means number of jobs slots per VM
As shown in figure 8, there is 6% performance improvement for the same workload through reduced completion times.
HPC throughput workloads are compute intensive and require scalable and highly performant infrastructure. We evaluated a set of HPC throughput applications including Bioperf on VMware Cloud on AWS. The AWS native EFS service was leveraged as shared filesystem which is required by the environment. The results have clearly shown that the VMware Cloud on AWS infrastructure performs and scales well for HPC throughput applications. This solution has also shown that native AWS services like EFS can be effectively used for these applications.
Authors: Michael Cui, Mohan Potheri, Na Zhang