Apache Spark is a unified analytics engine for large-scale data processing. The recent release of Apache Spark 3.0 includes enhanced support for accelerators like GPUs and for Kubernetes as the scheduler. VMware Cloud Foundation 4.x supports Kubernetes via Tanzu and provides enhanced accelerator capabilities. VMware Cloud Foundation can be a great platform for Apache Spark 3 as it supports the new capabilities of GPU acceleration and Kubernetes. This solution seeks to validate the VCF platform with Tanzu for Apache Spark 3. NVIDIA RAPIDS and XGBOOST are important components that are optimized for Apache Spark 3 and our solution will validate these use cases on the VMware Cloud platform.
Apache Spark is the de facto unified engine for big data processing, data science, machine learning and data analytics workloads. This year is Spark’s 10-year anniversary as an open source project. Since its initial release in 2010, Spark has grown to be one of the most active open source projects.
Recently released Apache Spark 3 adds compelling features like adaptive query execution; dynamic partition pruning; ANSI SQL compliance; significant improvements in pandas APIs; new UI for structured streaming; up to 40x speedups for calling R user-defined functions; accelerator-aware scheduler; and SQL reference documentation.
2.1 Spark 3 adds GPU Awareness
Contents
- 1 2.1 Spark 3 adds GPU Awareness
- 2 2.2 Spark 3 provides enhanced support for Deep Learning
- 3 2.3 Better Kubernetes Integration
- 4 2.4 SPARK 3.0 with Kubernetes operator:
- 5 2.5 Accelerated Analytics and AI on Spark
- 6 2.6 NVIDIA RAPIDS:
- 7 2.7 New RAPIDS Accelerator for Spark 3.0
- 8 2.8 XGBOOST:
- 9 3.1 Spark History server:
- 10 3.2 Harbor Container Registry:
- 11 3.3 VMware Cloud Foundation with Tanzu
- 12 3.4 Solution Components:
- 13 3.5 Building Blocks of the solution:
GPUs and other accelerators are widely used for accelerating specialized workloads like deep learning and HPC applications. While Apache Spark is used to process large datasets and complex data scenarios like streaming, GPUs that are needed for machine learning by data scientists were not supported until recently. Spark did not have awareness of GPUs exposed to it and was not able to request GPUs and schedule them for users causing a critical gap for the unification of big data and AI workloads.
Spark 3 bridges the gap between big data and AI workloads by
- Updating cluster managers to include GPU support and exposing user interfaces to allow for GPU requests
- Updating the scheduler to understand availability of GPUs that are allocated to executors, user task requests, and assign GPUs to tasks properly.
2.2 Spark 3 provides enhanced support for Deep Learning
Deep Learning on Spark was possible in earlier versions, but Spark MLlib was not focused on Deep Learning, its algorithms and didn’t offer much support for image processing. Hybrid projects like TensorFlowOnSpark, MMLSpark, etc. made it possible but using them presented significant challenges. Spark 3.0 handles the above challenges much better with its added support for accelerators from NVIDIA, AMD, Intel, etc. In Spark 3.0 vectorized UDFs can leverage GPUs for acceleration.
2.3 Better Kubernetes Integration
Spark support for Kubernetes was rudimentary in the 2.x version as it was difficult to use in production. Its performance was lacking when compared to that of the YARN cluster manager. Spark 3.0 introduces the new shuffle service for Spark on Kubernetes that allows dynamic scale up and down of Spark executors in Kubernetes.
Figure 1: GPU accelerated GPU Accelerated Apache SPARK 3
Spark 3.0 also supports GPUs with Kubernetes providing for pod level isolation of executors, making scheduling flexible on a GPU enabled cluster.
2.4 SPARK 3.0 with Kubernetes operator:
The Kubernetes Operator for Apache Spark makes running Spark applications as easy and seamless as running other workloads on Kubernetes. The Kubernetes Operator for Apache Spark ships with a command-line tool called sparkctl that offers additional features beyond what kubectl is able to do. It uses Kubernetes custom resources for specifying, running, and surfacing the status of Spark applications.
2.5 Accelerated Analytics and AI on Spark
Spark 3.0 marks a key milestone for analytics and AI, as ETL operations are now accelerated while ML and DL applications leverage the same GPU infrastructure. The complete stack for this accelerated data science pipeline is shown below. The use cases for this solution will leverage RAPIDS and XGBoost in this stack to validate the capabilities of VMware Cloud Foundation.
Figure 2: GPU accelerated GPU Accelerated Apache SPARK 3. (Source: GPU Accelerated Apache Spark )
2.6 NVIDIA RAPIDS:
The RAPIDS suite of software libraries, built on CUDA-X AI, gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces. RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar DataFrame API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.
2.7 New RAPIDS Accelerator for Spark 3.0
NVIDIA CUDA® is a revolutionary parallel computing architecture that supports accelerating computational operations on the NVIDIA GPU architecture. NVIDIA has created a RAPIDS Accelerator for Spark 3.0 that intercepts and accelerates ETL pipelines by dramatically improving the performance of Spark SQL and DataFrame operations.
RAPIDS offers a powerful GPU DataFrame based on Apache Arrow data structures. Arrow specifies a standardized, language-independent, columnar memory format, optimized for data locality, to accelerate analytical processing performance on modern CPUs or GPUs. With the GPU DataFrame, batches of column values from multiple records take advantage of modern GPU designs and accelerate reading, queries, and writing. (Additional details at Acceleration Apache Spark with GPUs)
2.8 XGBOOST:
XGBoost is a well-known gradient boosted decision trees (GBDT) machine learning package used to tackle regression, classification, and ranking problems. It’s written in C++ and NVIDIA CUDA® with wrappers for Python, R, Java, Julia, and several other popular languages. XGBoost now includes seamless, drop-in GPU acceleration, which significantly speeds up model training and improves accuracy for better predictions.
Figure 3: XGBOOST for GPU powered Apache SPARK 3 (Source: GPU Accelerated Apache Spark)
3.1 Spark History server:
The Spark History Server is a User Interface that is used to monitor the metrics and performance of the completed Spark applications. This is where Spark history Server comes into the picture, where it keeps the history (event logs) of all completed applications and its runtime information which allows you to review metrics and monitor the application later in time. History metrics are very helpful when you are trying to improve the performance of the application where you can compare the previous runs metrics with the latest run.
3.2 Harbor Container Registry:
Harbor is an open source trusted cloud native registry project that stores, signs, and scans content. Harbor secures artifacts with policies and role-based access control, ensures images are scanned and free from vulnerabilities, and signs images as trusted. Harbor extends the open source Docker Distribution by adding the functionalities usually required by users such as security, identity and management. Having a registry closer to the build and run environment can improve the image transfer efficiency. Harbor supports replication of images between registries, and also offers advanced security features such as user management, access control and activity auditing. Harbor is a CNCF Graduated project that securely manage artifacts across cloud native compute platforms like Kubernetes and Docker.
3.3 VMware Cloud Foundation with Tanzu
VMware Cloud Foundation with Tanzu delivers hyper-speed Kubernetes that provides agility, flexibility and security for modern apps. VMware Tanzu delivers the infrastructure, and services to meet changing business needs to rapidly deploy new applications. VCF provides consistent infrastructure and operations with cloud agility, scale and simplicity.
Figure 4: VMware Cloud Foundation with Tanzu
VMware Cloud Foundation with Tanzu is a Hybrid Cloud Platform that accelerates development of modern applications that automates the deployment and lifecycle management of complex Kubernetes environments.
- IT admins have complete visibility and control of virtualized compute, network and storage infrastructure resources through VCF.
- Software defined compute, storage and networking with vSphere, NSX-T and vSAN/VVOL provides ease of deployment and automation.
- Developers have frictionless access to Kubernetes environments and infrastructure resources through VCF Services.
- VMware Cloud Foundation provides runtime services automation services and infrastructure Services, all delivered via Kubernetes and RESTful APIs
3.4 Solution Components:
There are two distinct use cases that were deployed and validated in the solution. These include
- TPC-DS with NVIDIA RAPIDS
- Execute TPC-DS queries using Spark operator on Tanzu Kubernetes Cluster(TKG) and incorporating NVIDIA GPUs
- Validate Spark Performance with Tanzu and NVIDIA GPU leveraging RAPIDS
- Machine Learning with XGBOOST
- Machine learning mortgage data using Spark operator with XGBOOST on Tanzu Kubernetes Cluster (TKG) incorporating NVIDIA GPUs
3.5 Building Blocks of the solution:
Figure 5: HW and Software components of the solutions
In part 2 of the blog series we will look at the implementation of the solution and its validation.