February 28, 2020

What we Learned by Rewriting our APIs

Table of Contents:

At Clarifai we strive to build the best AI APIs on the market. We strongly believe in good conventions and rapid iteration on features so that you can get the latest in AI into your applications. We have gone through a few iterations of our API server framework that handles the traffic to our customers. These iterations began in 2014 with a python based Django server that we called v1 of our API. 


Clarifai API - the early days

The advantages of Django in the early days of Clarifai were that all our researchers and engineers used Python as our go-to language and Django was the de facto web framework in Python. It came equipped with a huge amount of built-in functionality so that we could get endpoints created quickly. It also had a really great built-in ORM around the database layer so that we could manage user accounts and data with ease.

Python allowed for rapid prototyping, but we quickly found many limitations to using it as an API server as our customer traffic and our team size increased. With customer traffic, the concurrency model of threading in Python became an issue surrounding the GIL. We also had to combine the Django server with uwsgi and nginx to serve traffic in production, each with their own threading model. This complicated the life of a request drastically, especially before we had great tools like Kubernetes that allows deploying dependencies with ease.

Most of these concurrency issues could be handled with careful deployments and hand-tuning of threading configuration, but new engineers found it difficult to understand these fine-tuned services. Combine this with the un-typed nature of Python and rapid development of endpoints quickly became difficult. We were building new products around search (the pre-cursor to what is now launched as our search APIs and UIs in Portal) and as we moved quickly in the early days with very few tests, we saw instability increase as we added new endpoints. This led to a large untested codebase. Tracing through that code, especially without a compiler or types in place, quickly became untenable. 


Introducing v2 and Go

At the start of 2016, two years after the introduction of our v1 API, we knew we wanted to introduce v2. We had the freedom to redesign the API conventions, introduce a few crucial new features at launch (both custom training of models and visual search), and make important engineering decisions on how to make the API robust for future scale and extensibility. This was our opportunity to take a step back and see if we could avoid the downfalls of our v1 codebase. Investigation into alternatives ultimately led us to the compiled and typed go language. As soon as we learned more about it, we fell in love. 

We loved that the compiler was so quick to build a new program that you could actually keep it running in a file with a loop using a tool called CompileDaemon. This not just resulted in a nice binary which was very lightweight and portable for deployments, but it allowed us to write much better code much faster in an interpreted language. The typing also made the quickly growing codebase and growing team less of an issue. 

Go also comes with a much different concurrency model by using goroutines. Without the limitation of the Python GIL and the need for different production deployment configurations like uwsgi; this greatly simplified how we can scale our traffic to many simultaneous API requests. Goroutines are so lightweight we found ourselves also leveraging them for async processing in early versions of our API as it was so seamless to add that in request handlers. 

To handle serving requests form Go, we also needed to change web servers from Django to a Go language equivalent. Unfortunately, there was nothing nearly as good as Django. We landed on Goji and went through using v1 of goji then v2 eventually, but eventually replaced both with grpc-gateway around 2017 and have never looked back. Grpc-gateway allows us to define our API completely in protobufs so that we can compile server-side handler logic, the server mux to route requests, an HTTP to grpc converter provided by grpc-gateway and API clients all from one consistent definition of what our API does. It gave us a well defined API specification in which we could scale our API quickly and easily. In a follow-up blog post, I’ll outline how we leverage grpc-gateway to handle both grpc traffic and HTTP traffic even on the same hostport.


Platform maturity

Now in 2020, we have over 140 endpoints developed in our v2 API compared to the original under 10 endpoints in v1. These endpoints are thoroughly tested with both unit tests and integration tests, and deployable with ease in a kubernetes cluster as an array of microservices. We also have a new process in which all users can get notified of upcoming API changes in our docs page. Just subscribe to our github repo for documentation and you’ll get notified automatically of any updates . . . you can even help us improve our documentation.

Don’t get me wrong, we haven’t moved away from Python by any stretch, but now with microservices we’ve leveraged both Python and Go to their strengths. We have all the high concurrency API and DB operations written in Go, while all AI and media processing are handled in Python. It’s been a win-win ever since.

If you want to see the results of our API in action, sign up for an API key today!