WEBINAR | AI Prototype to Production: Operationalizing and Orchestrating AI
August 10, 2016

Scale Your GPU Cloud Infrastructure With Kubernetes

Table of Contents:

When Kubernetes 1.3 launched last month, we were excited to see our contribution to preliminary support for GPU scheduling become available to everyone. We know a lot about GPUs and what makes them tick, so we wanted to use our expertise to help build the best GPU support for k8s. Find out why we’re so jazzed about Kubernetes, how we use it, our role in bringing it to the world, and where we’re headed next.


An increasing number of workloads, such as Clarifai’s machine learning, benefit from offloading computations to highly parallel graphic hardware. While not finely tuned in the same way as traditional high-performance systems such as MPI, a Kubernetes cluster can still be a great environment for those needing a variety of additional, “classic” workloads, such as databases, web serving, etc.

Clarifai has been experimenting for months with Kubernetes on Linux as our production platform, migrating services from an infrastructure relying on virtual machines to one that is container-based. One of the last stumbling blocks was GPU support. We’d like to share a little about our experience, our contributions to Kubernetes, and what we’d like to see in the future.

The Problem

Satisfactory GPU support will take time to address completely, because:

  • different vendors expose the hardware to users in different ways
  • some vendors require some coupling between the kernel driver controlling the GPU and the libraries/applications that access the hardware
  • it adds more resource types (whole GPUs, GPU cores, GPU memory)
  • it can introduce new security pitfalls
  • for systems with multiple GPUs, affinity matters, similarly to NUMA considerations for CPUs
  • running GPU code in containers is still a relatively novel idea

 That said, the advantages are worth the effort required to put the devices to work.

We wanted to be no longer tied to whole virtual machines and their images as the basic scheduling unit. They’re slow to assemble and deploy. They’re not very flexible. To add insult to injury, they can leave resources stranded: in our case, GPU machines still had unused CPU cycles and memory. A lot of memory. We had already started deploying services under Kubernetes, with great results. We now wanted to schedule GPU workloads on machines with the right hardware, along with other, less finicky jobs.

A First Approach

Following a number of discussions that started at KubeCon 2015, we simplified the initial problem to NVIDIA GPUs, because they are very well established in the machine learning world, already power our existing infrastructure and are the only option offered by cloud providers such as AWS and Azure. In the interest of disclosure, NVIDIA also happens to be an investor in Clarifai.

We needed support in two places at a minimum: in the scheduler and in the runtime daemon that manages the node, the kubelet. The former has to make sure that pods get scheduled on machines that have the actual hardware (we use a variety of machine types: GPU ones are effective, but not cheap). The latter needs to make the right Linux device files appear in the container under /dev.


Before all of that, we actually implemented and contributed another feature: adding node labels with the cloud provider’s machine type information, which helps us in tracking resource allocation and waste, for each hardware configuration. Label-based node selection does allow scheduling pods on e.g. GPU instances, but unfortunately, that’s not enough to access the cards.

Working with the Kubernetes team, we came up with a stripped-down GPU support proposal. It was just as important to decide what not to include versus what to include: once a feature or behavior is introduced, even if labeled as alpha or beta, it is very difficult to remove. We settled on a custom, experimental resource: It is very basic: it only counts whole devices. It does not, deliberately,:

  • take into consideration GPU vintage, core count, memory, etc.
  • support more than one device per machine in 1.3
  • allow multiple pods on the same machine to share the same card, even if you know what you’re doing (at least on paper: ask us about this one weird trick to do just that!)
  • build or setup drivers
  • automatically expose low-level libraries such as libcuda, libnvidia-ml and libvdpau_nvidia to the containers.

Even with all those limitations, yes, Virginia, you can now schedule containers on e.g. EC2 g2.2xlarge instances — not g2.8xlarge ones, unless you are OK with three GPUs getting ignored by Kubernetes the entire time.

How to use it

Despite the fine print above, you still want to run GPU containers. What do you need?

  1. Make sure you’re using Kubernetes 1.3.3 or later. 1.3.0­—1.3.2 have a known bug.
  2.  CUDA and similar require you to access Nvidia GPUs through proprietary drivers that are not in the official Linux kernel tree. The exact steps depend on the distribution you use. If, like us, you chose CoreOS, read below on how we simplified and automated the build process. In general,  correct kernel drivers have to be built and installed on your nodes, along with their associated device files in /dev: nvidia0, nvidiactl and nvidia-uvm. You can start from the official driver site, select the latest version and follow the README file for instructions.
  3.  The low-level libraries from the NVIDIA installer have to reside somewhere on the node. Ideally, you’d place them in their own directory, which you then add to’s configuration (don’t forget to run ldconfig to regenerate symlinks for major versions).
  4.  Optional: repeat the previous step for any binaries you might need, such as nvidia-cuda-mps-server and nvidia-smi. The latter is especially useful for debugging and monitoring purposes.
  5.  Add definitions for volume and volume mounts to the pod spec in your Deployment configuration file, mapping the directories for libraries and binaries into the pod. Remember to define the volumes using hostPath.
  6.  Add 1 to both limits and requests in the resources section of your pod spec.
  7.  Run your kubelets with the new flag –experimental-nvidia-gpus=1.

If your nodes run CoreOS, this is your lucky day: we have tools to help with the second and third steps. They automate the building and packaging, on any Linux distribution, of a specific version of the drivers for a specific version of CoreOS. Check out our other blog post for more information about our GitHub repo.

If, regardless of your distribution, you really don’t want to get into the business of installing libraries on the host and then making them available to the containers, you can bake them into your Docker image, at a cost of about 150MB and having to rebuild all your images every time your nodes switch to a new kernel. We do not recommend that.

Pain points and future work

This section was written from our perspective as Kubernetes users, not contributors. It is not a commitment by us or the Kubernetes team. Consider it aspirational. Void where prohibited by law. If you agree — or do not agree — get involved in the project!

As might be obvious from the steps above, there is still extra work involved. First and foremost, low-level libraries have to be projected from the host into the container. You need to be careful to expose at least the same version that your Docker image was built against and, definitely, the same version as your kernel module. If you store them in separate directories, you can keep many different versions of the libraries on the node; just make sure you mount the right one. For example, if you built an image using CUDA 7.5, you need to use at least version 352.39 of the kernel drivers and low-level libraries.

To simplify deployment further, we’re working on Docker volume plugin support in Kubernetes. That paves the way for your nodes to use nvidia-docker, which takes care of most of the versioning: it detects all versions of the low-level libraries installed on the node, then presents the container with just the right set. Even more importantly, with the help of standard labels applied to the image, it refuses even to start a container, if it requires e.g. a CUDA version more recent than what is supported by the node’s drivers and low-level libraries. That spares you from troubleshooting mysterious crashes at startup time — the most hated kind of crash here at Clarifai!

Even nvidia-docker takes some effort to configure and keep running. Our hope is that, at some point, its code (or an equivalent) will be merged or run as an even simpler Kubernetes plugin.

Beyond that, we know that better discovery is needed and is being worked on. GPUs come in all shapes and sizes, so to speak. Our largest machine learning models can be served from all our hardware configurations, but can only be trained on the most powerful ones, since the process is more resource intensive than just performing inference — returning a prediction based on user data. We want the ability to report some configuration information so that the Kubernetes scheduler can pick the right hardware. We also need to support configurations with multiple cards.

Last, but not least, some form of overcommitting is desirable for long-running jobs serving user data. While batch jobs tend to keep hardware busy all the time, backends serving user-originated traffic typically follow cyclic patterns (daytime vs. nighttime, weekdays vs. weekends, etc.) and are often idle. To increase utilization, we need a way to share the same device between different jobs, especially when they don’t require all of a card’s memory. Along with Kubernetes autoscaling and latency-based custom metrics, this can be done in a manner that does not compromise user experience or SLOs. Precautions need to be taken, of course, to prevent e.g. production jobs from coexisting with development ones, but we’re optimistic about the viability of the whole approach.

Summing up

Things are getting better and will continue to do so. 


Thanks to Kubernetes maintainers, especially Eric Tune, David Oppenheimer, Dawn Chen, Tim Hockin and Vish Kannan, as well as Hui-Zhi Zhao, for feedback and getting the initial support reviewed for merging upstream. At Clarifai, Nand Dalal and Matt Zeiler have been invaluable in figuring how GPUs behave in real-world production environments a bit outside the norm. Finally, thanks to our friends at NVIDIA for their help and for nvidia-docker.

Read the next post in our series:
GPUs, CoreOS, and containers are three major ingredients behind Clarifai’s magic. Learn how we made it easier to mix them in our lab, with no safety goggles required. Look, ma, no spills!