Gear Up Your AI: Fine-Tuning LLMs
February 28, 2020

4 Keys to Great Training Data

Table of Contents:

Clarifai makes it easy to build custom computer vision models that require minimal training data, but you will need SOME training data if you have a specialized business problem. Here is what to look for in great training data for computer vision AI.

Clarifai requires less training data for custom computer vision models

One of the things that differentiates Clarifai from all of our competition, is that our advanced transfer learning architecture makes it possible for our customers to “do AI” with less work and less training data. 

Transfer learning means that we can train a neural network model on known tasks, and then use this model as the foundation of a new purpose-specific model. In effect, transfer learning means that your model can benefit from years of preprocessing, done on high performance machines, with high-quality training data. 

It's a little bit like teaching a basketball player to play soccer. Even though these games are very different, basketball players already know how to move, play with a ball and work together on a team. A professional basketball player is bound to be better at soccer than someone with no background in sports. They can transfer many of their basketball skills to soccer . . . with a little training.


You need SOME training data

Our customers love transfer learning because it means that their businesses can get started in AI quickly, with less time and data put into training their new models. 


We’ve trained up our base models, and fed them with the best training data and algorithms that our award-winning research scientists can find. It's like Clarifai is constantly recruiting the best athletes for your team and keeping them in shape.

Our base models are the ideal starting point for many business applications, and deliver high performance right out of the box. But there is no getting around the fact that your business will need SOME training data if you are going to build a unique solution for your specific business problem. 


4 Keys to great training data for computer vision AI


1) Accurate annotations

Training data is different than other data because it has been identified with “labels” or “annotations”. These annotations are there to explain the rules of the new game to your AI model. It is a bit like explaining the rules of soccer to a basketball player and then giving them some time to practice.

Perhaps it should go without saying, but the accuracy of your annotations is essential to your success. 

Data annotation can be a tedious and time consuming process, and in the beginning you are going to need a human to do the annotating for you. This means that errors are possible, and quality assurance is essential.

You need to make sure that your basketball player is being taught the correct rules for soccer, is practicing on a soccer field, and playing against other soccer players, if you hope for them to succeed.


2) Distinct features

Computer vision AI can recognize visual data better and faster than the human eye, but it is still recommended that you use good human intuition when selecting your training data. Images should provide distinct visual examples of the features that you want your new model to recognize.

As a rule of thumb, if a person cannot recognize a feature in an image, then it is likely that this input is not going to be a good training sample. Data samples must be chosen so that features are recognizable to the average user. 

Poorly lit, hazy, or confusing images are not recommended as training data. Clearly explain the rules of the new game to your model.


3) Representative samples

Training data should be similar to the data that your application will be interacting with in the real world. 

For example, we work with many retailers who sell their products online. These retailers often have product images that have been taken in the studio on a white backdrop. These studio images can be great for displaying and selling products online, but are not ideal when used to identify products images taken by users. User images are generally not taken in a studio on a white backdrop. They might be taken in a living room, or on the street. 

Look for training images that were taken in similar conditions and environments to the data that your model will be working with in production. This context is key because it affects the patterns of light recognized by your new model.


4) Sufficient number of samples

Practice makes perfect.

You will need a sufficient number of samples to get your idea across. Depending on the use case, some models require significantly more training data than others. If your business has a particularly unique problem, or if you need high levels of accuracy for your solution, you may find that you need a larger training data set. 

If you want your professional basketball player to become a professional soccer player, it is going to take a lot of time and practice.


Clarifai’s Data Annotation Services

The process of data annotation can be difficult, tedious, time-consuming and prone to error. We are here to help with data annotation services to support the creation of your custom model. 

Our team can help you understand and interpret your goals with AI, and ensure that you have the high quality training data you need for success. We will preprocess your data, and deliver custom, verified labels in your file format of choice.

Find out more here!