Machine Learning on Kubernetes with Caffe2 & PyTorch on VMware SDDC & PKS Enterprise (Part 2 of 2)

In part 1 we introduced the solution and its deployment. In part 2 we will look at the validation of the solution and the results.

Testing Methodology:

Contents

1 Testing Methodology:
2 Results:
3 Conclusion:
4 Appendix A: PyTorch Container details
5 Appendix B: Caffe2 Container details

The goal of the testing is to validate the vSphere platform for running Caffe2 and PyTorch. Some of the capabilities such as sharing GPUs between containers was evaluated and tested.

For the baseline, a container with a full GPU was launched in Kubernetes and three different deep learning models were run on the two platforms Caffe2 and PyTorch.
The same tests were repeated with four concurrent pods sharing the same GPU. Bitfusion FlexDirect was used to provide partial (0.25) GPU access to containers.
The results were then compared against the baseline

The two cases are summarized in the table below.

Table 3: Remote and Full GPU use Cases

Results:

The baseline tests were run on a single container with full access to the GPU. The models were run in sequence multiple times to get the baseline images processed per second. Appendix A and B provide details about the containers used for Caffe2 and PyTorch.

Table 4: Image Throughput with PyTorch testing

For the sharing use case, the benchmarking jobs run randomly across all four client containers in parallel. The baseline tests were run on a single virtual machine with full access to the remote GPU. The four docker containers were allocated 25% of the remote GPU resources. The tests were repeated multiple times allowing for repeatability. The throughput measured in images per second and the time of completion was compared between the shared use case and the baseline.

The testing was then run for PyTorch leveraging custom scripting and the deep learning benchmark. The results from the PyTorch testing are shown in tabular and graphical format.

Figure 4: Results from running different deep learning models on PyTorch with and without GPU sharing

The testing was then repeated for Caffe2 leveraging custom scripting and the deep learning benchmark. The results from the Caffe2 testing are shown in tabular and graphical format.

Table 5: Image Throughput with Caffe2 testing

Figure 5: Results from running different deep learning models on Caffe2 with and without GPU sharing

The results from both the PyTorch and Caffe2 testing clearly show benefits to sharing GPUs across multiple containers. This could be a result of the entire GPU not being used by the different models. Each model shows gains for sharing but it differs based on their profile.

Conclusion:

We have shown that Caffe2 & PyTorch Deep learning frameworks work well with VMware SDDC & PKS. The solution effectively leveraged a Deep Learning benchmarking suite with Caffe2 & PyTorch to automate and run common machine learning models with scalability and improved utilization of NVIDIA GPUs. The solution showcases the benefits of combining best in class infrastructure provide by the VMware SDDC with production grade Kubernetes of PKS, to run open source ML platforms like Caffe2 & PyTorch efficiently.

Appendix A: PyTorch Container details

# This example Dockerfile illustrates a method to install

# additional packages on top of NVIDIA’s PyTorch container image.

# To use this Dockerfile, use the docker build command.

# See https://docs.docker.com/engine/reference/builder/

# for more information.

# This is for HP DLB cookbook – pytorch

# But experimenter fails – looks like it needs python 2.7

FROM nvcr.io/nvidia/pytorch:18.06-py3

# Install flexdirect

RUN cd /tmp && wget -O installfd getfd.bitfusion.io && chmod +x installfd && ./installfd -v fd-1.11.2 — -s -m binaries

# Install python2.7 which is needed for experimenter.py used in hpdlb benchmark

RUN apt-get update && apt-get install -y –no-install-recommends

python2.7

rm -rf /var/lib/apt/lists/

RUN mkdir -p /workspace/dlbs

WORKDIR /workspace/dlbs

# IMPORTANT:

# Build docker image from the dir: /data/tools/deep-learning-benchmark

COPY ./dlbs /workspace/dlbs

Appendix B: Caffe2 Container details

# This example Dockerfile illustrates a method to install

# additional packages on top of NVIDIA’s Caffe2 container image.

# To use this Dockerfile, use the docker build command.

# See https://docs.docker.com/engine/reference/builder/

# for more information.

# This is for HP DLB cookbook – Caffe2

FROM nvcr.io/nvidia/caffe2:18.05-py2

# Install flexdirect

RUN cd /tmp && wget -O installfd getfd.bitfusion.io && chmod +x installfd && ./installfd -v fd-1.11.2 — -s -m binaries

RUN mkdir -p /workspace/dlbs

WORKDIR /workspace/dlbs

# IMPORTANT:

# Build docker image from the dir: /data/tools/deep-learning-benchmark

COPY ./dlbs /workspace/dlbs

Search Engine Optimization (SEO) is a crucial strategy for enhancing a website‘s visibility in organic search results, driving traffic, and improving online presence. By optimizing content and keywords, businesses can attract targeted audiences, improve brand credibility, and enhance user experience. SEO offers long-term benefits by continuously drawing organic traffic without the need for ongoing ad… […]

Optimizing the speed of a WordPress website is essential for ensuring a superior user experience and achieving higher rankings in search engine results. Slow websites not only deter visitors but also impact your site’s SEO negatively. Fortunately, there are various strategies you can employ to enhance your website‘s performance. This comprehensive guide outlines ten simple… […]

Machine Learning on Kubernetes with Caffe2 & PyTorch on VMware SDDC & PKS Enterprise (Part 2 of 2)

Testing Methodology:

Results:

Conclusion:

Appendix A: PyTorch Container details

Appendix B: Caffe2 Container details

Terraform vCloud Director Provider v2.5.0 Features

Updated – Microsoft SQL Server on VMware vSphere Availability and Recovery Options

Feature Friday – Episode 14 – Cloud Director Advanced UI Plugin Sandbox

Database-as-a-Service (DBaaS): Reference Architecture with VMware and Tintri

Join us LATER TODAY for a LIVE PANEL on vSphere & vSAN 7 Update 2

Running singularity as the run time for Kubernetes with VMware PKS

BEST Web Hosting

TECHNOBABBLE

How to build a website with WordPress and what are the best plugins to use

Drupal Hosting

© Copyright

Testing Methodology:

Results:

Conclusion:

Appendix A: PyTorch Container details

Appendix B: Caffe2 Container details

Related Posts

© Copyright