August 10, 2016

Simplify Building Nvidia GPU Drivers On CoreOS

Table of Contents:

 

FirstClarifaiComputer

 

At Clarifai, we’re big fans of GPUs. The whole company started as a single desktop computer in a Ph.D. candidate’s apartment, packed with two Titan GPU cards, which went on to win ImageNet in 2013. We have grown a lot since then, but, in many ways, we still try the squeeze as much as possible out of our hardware.

We’re also big fans of containers. We have been migrating our production environment toward Kubernetes clusters running under CoreOS, an ideal host for containers. Thanks to this combination, we can squeeze a lot more out of our CPUs and memory. Check our blog for another post on how we’re starting to do the same with our GPUs, too.

Combine the two and… what do you get? Until now, not a lot of love. The traits that make CoreOS well-suited for containers, its minimalism in particular, also make it harder to build the proprietary NVIDIA drivers needed by our hardware. There is no package manager! No Python! No Perl! There’s thing called Toolbox, but it’s actually a container running Fedora. What did we get ourselves into?

There are solutions out there, but they’re a bit clumsy. For example, they might involve a Docker image that builds the driver within an Ubuntu environment. It uses an arbitrarily picked compiler. Even if it’s the same major/minor version as the one in CoreOS, does it include any patches? Also, you need to keep track of the exact kernel source. If it sounds like a headache, that’s because it is.

Well, do not despair. With pointers from CoreOS folks, we have found a better way. A little-advertised feature, the developer container image, lets us compile the driver within the same environment that built a particular CoreOS version. In other words, it uses the same toolchain and kernel configuration used by CoreOS build system to put together a given release. Even better, the scripts can be started from a machine running any kind of Linux distribution: it doesn’t have to be CoreOS. The main requirements are a recent version of systemd-nspawn and about 4GB of free disk space.

Visit our coreos-nvidia repository and check out our scripts. Feel free to help by reporting issues or, better, submitting pull requests. Or, best of all, by joining us: we’re always hiring!

Acknowledgments 

Thanks to CoreOS’ team, Brandon Philips and Michael Marineau in particular, for pointing me in the right direction and for doing a great job in general. Thanks also to our friends at NVIDIA (which happens to be an investor in Clarifai).

 

Read the next post in our series: 

When Kubernetes 1.3 launched last month, we were excited to see our contribution to preliminary support for GPU scheduling become available to everyone. We know a lot about GPUs and what makes them tick, so we wanted to use our expertise to help build the best GPU support for k8s. Find out why we’re so jazzed about Kubernetes, how we use it, our role in bringing it to the world, and where we’re headed next.