Accelerated Apache Spark 3 leveraging NVIDIA GPUs on VMware Cloud (Part 3 of 3)

In part 1 of the series we introduced the solution and the components used. In part 2 we looked at the first use case leveraging GPUs with Apache Spark 3 and NVIDIA RAPIDS for transaction processing.. In this blog we will look at the use of GPUs with Apache Spark 3 and XGBoost for accelerated ML training.

Use Case 2: Model mortgage data using Spark XGBoost running on Kubernetes with Spark Operator

In this use case we use an example mortgage applications to demonstrate the GPU-accelerated XGBoost-Spark project. Further details of the Spark XGBoost examples are described in the following page:

This repo provides docs and example applications that demonstrate the GPU-accelerated XGBoost-Spark project. The Scala programs in this reference were used for the solution. The Mortgage data from the link provided on this page are used for the validation.

Use Case Prerequisites

  • Apache Spark 3.0+ (e.g.: Spark 3.0)
  • Hardware Requirements
    • NVIDIA Pascal™ GPU architecture or better
    • Multi-node clusters with homogenous GPU configuration
  • Software Requirements
    • Ubuntu 16.04/CentOS7
    • CUDA V10.1/10.2)
    • NVIDIA driver compatible with your CUDA
    • NCCL 2.4.7
  • Kubernetes 1.6+ cluster with NVIDIA GPUs
  • kubectl installed and configured in the job submission environment
    • Required for managing jobs and retrieving logs

Create docker image to run Spark XGBoost application related tasks

Created a custom docker image based on nvidia/cuda:10.2-devel-ubuntu18.04 with XGBoost libraries. This article was used as a reference for creation of XGBoost image using Docker and was very useful in order to get the XGBoost libraries.

Data leveraged in use case

An example mortgage application was used to demonstrate the efficacy of XGBOOST leveraging GPUs with Apache Spark and Kubernetes. The Mortgage data was downloaded from this location and data for the years 2000 and 2001 was used in our deployment. Dataset is derived from Fannie Mae’s Single-Family Loan Performance Data.

Fannie Mae provides loan performance data on a portion of its single-family mortgage loans to promote better understanding of the credit performance of Fannie Mae mortgage loans. The population includes two datasets. The Single-Family Fixed Rate Mortgage (primary) dataset contains a subset of Fannie Mae’s 30-year and less, fully amortizing, full documentation, single-family, conventional fixed-rate mortgages. The HARP dataset contains approximately one million 30-year fixed rate mortgage loans that are in the primary dataset that were acquired by Fannie Mae from January 1, 2000 through September 30, 2015 and then subsequently refinanced into a fixed rate mortgage through HARP from April 1, 2009 through September 30, 2016.

Build Scala distribution jars for Mortgage ML

We followed the instructions to build Scala distribution jar file for Mortgage ML programs that use XGBoost libraries as given in the Maven Build instructions XGBoost Scala code for mortgage.

ETL conversion of Mortgage data

Before running ML training on the mortgage data, the data had to be converted using ETL programs. The process described in this guide was used to convert the raw mortgage data to a format compatible with that of the ML program. Some trial and error was used to get the right set of arguments for the Scala version of the ETL program.

The relative sizes of the data before and after ETL conversion of mortgage data corresponding to the year 2000 are shown. The directory “/sparknfs/sparkdata/mortgage-data/m2000/parque_out/data/” contains the output of ETL program in parquet format as shown in the figure below.

Figure 9: ETL Data output in Parquet format

Mortgage ML training

XGBoost was leveraged to run ML training on 1 year’s mortgage data. As the results below in the Spark history server shows the GPU run was much faster than CPU run.

  • CPU took 33 minutes to do training.
  • GPU completed training in just under 2 minutes.

Figure 10: XGBoost runtime comparison between CPUs and GPUs on Spark History Server

The results clearly show the massive 16X acceleration on GPU training runs with XGBoost versus CPU.

Mortgage ML test

As is customary in testing, a small portion of the data not used in the training was used as the test data. We created “test” data with one quarter’s mortgage data (Q1-2001) using ETL to prepare the data. Then we tested the model created in the training run in the previous step.

The results below shows that the trained model is efficacious with a 98% accuracy on the sample that we used:

            ==> Benchmark: Accuracy for [Mortgage GPU Accuracy parquet stub Unknown Unknown Unknown]: 0.9873203857439782


We successfully deployed GPU accelerated Apache Spark 3 on VMware Tanzu Kubernetes Grid in this solution. The solution effectively demonstrated that

  • VMware Cloud Foundation is a great platform for Apache Spark 3
  • VMware support for GPUs and Kubernetes can be effectively used with Apache Spark 3
  • TPC-DS with NVIDIA RAPIDS and GPUs were effectively showcased in the solution
  • NVIDIA GPUs were used with Kubernetes for Machine Learning with XGBoost