Establish an AI Operating Model and get out of prototype and into production
May 19, 2023

Using the evaluations module to improve model performance

Table of Contents:


Navigating the intricate world of machine learning is no small feat, and measuring model performance is undoubtedly one of the most crucial aspects. As we embark on this journey together through the realm of data science, we find ourselves frequently pondering over questions like: How good is our model? Is it performing well, and if not, how can we improve it? And most importantly, how can we quantify its performance?

The answers to these questions lie in the careful use of performance metrics that help us evaluate the success of our predictive models. In this blog post, we aim to delve into some of these essential tools: ROC AUC, recall, precision, F1 score, confusion matrices, and finding problematic input images.

Receiver Operating Characteristic Area Under the Curve (ROC AUC) is a widely used metric for evaluating the trade-off between true positive rate and false positive rate across different thresholds. It helps us understand the overall performance of our model under varying circumstances.

Precision and recall are two incredibly informative metrics for problems where the data is unbalanced. Precision allows us to understand how many of the positive identifications were actually correct, while recall helps us to know how many actual positives were identified correctly.

Confusion matrices are a simple yet powerful visual representation of a model's performance. This tool helps us understand the true positives, true negatives, false positives, and false negatives, thereby providing an overview of a model's accuracy.

This post should serve as your guide to unravel the complexities of model performance and evaluation metrics, providing insights into how you can effectively measure and improve your machine learning models.

How to get to the model evaluations module

From the model's page, select "See versions table"

From the versions table, click the "Calculate" button

Once it's finished, click "View Results"

The evaluations module will load. You can also change the holdout set to evaluate on other datasets

How can I improve model performance for specific concepts?

Concepts that perform well tend to be the ones that are annotated in images photographed in a consistent and unique way.

Concepts that tend to perform poorly are those:

  • a) trained on data with inconsistent compositions;
  • b) the photos require outside context (relationships to people in portraits, etc.); and/or,
  • c) the subject matter is subtle.

Keep in mind the model has no concept of language; so, in essence, “what you see is what you get.”

Let’s take a case of a false positive prediction made by a model in the process of training to recognize wedding imagery.

Here is an example of an image of a married couple, which had a false positive prediction for a person holding a bouquet of flowers, even though there is no bouquet in the photo.

What’s going on here?

concept performance

A photo’s composition and the combination of elements therein could confuse a model.

All the images below were labeled with the ‘Bouquet_Floral_Holding’ concept.

images for concept performance

In this very rare instance, the image in question has:

  • A veiled bride
  • The bride & groom kissing/their heads close together
  • Greenery over their heads
  • Large, recognizable flowers

The model sees the combination of all these individual things in lots of photos labeled ‘Bouquet_Floral_Holding’; and thus, that is the top result.

images for concept performance

images for concept performance

One way to fix this is to narrow the training data for ‘Bouquet_Floral_Holding’ to images in which the bouquet is the focal point, rather than any instance of the bouquet being held.

This way, the model can focus on the anchoring theme/object within the dataset more easily.

What is the ROC AUC score, and how does it relate to prediction accuracy?


closed-set and open-set recognition

Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI

The ROC AUC (Concept Accuracy Score) is the concept’s prediction performance score, defined by the area under the Receiver Operating Characteristic curve. This score gives us an idea of how well we have separated our different classes, or concepts.

ROC AUC is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. The AUC, or Area Under the Curve of these points, is (arguably) the best way to summarize a model’s performance in a single number.

You can think of AUC as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation, and thus it is a useful metric even for datasets with highly unbalanced classes.

A score of 1 represents a perfect model; a score of .5 represents a model that would be no better than random guessing, and this wouldn’t be suitable for predictions and should be re-trained.

Note that the ROC AUC is not dependent on the prediction threshold.

How do we read a concept-by-concept matrix?

concept-by-concept matrix

Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI

A concept-by-concept matrix is a graphic flattening of data to show what has been labeled for a concept. This tool is another way of visualizing the performance of a model.

It allows us to review where we see true positives, or correctly predicted inputs (the diagonal row). Simply put, this is an excellent tool for telling us where our model gets things right or wrong.

Each row represents the subset of the test set that was actually labeled as a concept, e.g., “dog.” As you go across the row, each cell shows the number of times those images were predicted as each concept, noted by the column name.

Along with AUC, what other insights can a confusion matrix provide?

  • Accuracy—Overall, how often is the model correct?
  • Misclassification Rate—Overall, how often is it wrong?
  • True Positive Rate—When it's actually yes, how often does it predict yes?
  • False Positive Rate—When it's actually no, how often does it predict yes?
  • Specificity—When it's actually no, how often does it predict no?
  • Precision—When it predicts yes, how often is it correct?
  • Prevalence—How often does the yes condition actually occur in our sample?

The diagonal cells represent True Positives, i.e., correctly predicted inputs. You’d want this number to be as close to the Total Labeled as possible.

Depending on how your model was trained, the off-diagonal cells could include both correct and incorrect predictions. In a non-mutually exclusive concepts environment, you can label an image with more than 1 concept.

For example, an image labeled as both “hamburger” and “sandwich” would be counted in both the “hamburger” row and the “sandwich” row. If the model correctly predicts this image to be both “hamburger” and “sandwich,” then this input will be counted in both on and off-diagonal cells.

concept-by-concept matrix

Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI

This is a sample confusion matrix for a model. The Y-axis Actual Concepts are plotted against the X-axis Predicted Concepts. The cells display average prediction probability for a certain concept, and for a group of images that were labeled as a certain concept.

The diagonal cells are the average probability for true positives, and any cells off the horizontal cells contain the average probability for non-true positives. From this confusion matrix, we can see that each concept is distinct from one another, with a few areas of overlap, or clustering.

Concepts that co-occur, or are similar, may appear as a cluster on the matrix.

How can I improve a model by drilling down to “problematic cells” in a confusion matrix?

problematic cells in a confusion matrix

What is the importance of recall and precision rate?

Recall rate refers to the proportion of the images labeled as the concept that were predicted as the concept. It is calculated as True Positives divided by Total Labeled. Also known as “sensitivity” or “true positive rate.”

Precision rate refers to the proportion of the images predicted as a concept that had been actually labeled as the concept. It is calculated as True Positives divided by Total Predicted. Also known as “positive predictive value.”

You can think of precision and recall in the context of what we want to calibrate our model towards. Precision and recall are inversely correlated; so, ultimately the ratio of false positives to false negatives is up to the client according to their goal.

We’re asking one of the following of our model:

  • That the guesses are correct, while missing some concepts (high precision);


  • That most things are considered to be predicted as a concept, while having some wrong predictions (high recall).


Precision = tp÷(tp+fp)

I guess for X, and my guess is correct, although I may miss another X.


Recall = tp ÷ (tp+fn)

I guess all the X as X, but occasionally predict other subjects that are not X as X.

How do we choose a prediction threshold?

A threshold is the “sweet spot” numerical score that is dependent on the objective of your prediction for recall and/or precision. In practice, there are multiple ways to define “accuracy” when it comes to machine learning, and the threshold is the number we use to gauge our preferences.

You might be wondering how you should set your classification threshold, once you are ready to use it to predict out-of-sample data. This is more of a business decision, in that you have to decide whether you would rather minimize your false positive rate or maximize your true positive rate.

If our model is used to predict concepts that lead to a high-stakes decision, like a diagnosis of a disease or moderation for safety, we might consider a few false positives as forgivable (better to be safe than sorry!). In this case, we might want high precision.

If our model is used to predict concepts that lead to a suggestion or flexible outcome, we might want high recall so that the model can allow for exploration.

In either scenario, we will want to ensure our model is trained and tested with data that best reflects its use case.

Once we have determined the goal of our model (high precision or high recall), we can use test data that our model has never seen before to evaluate how well our model predicts according to the standards we have set.

Once a model is trained and evaluated, how do we determine its accuracy?

The goal of any model is to get it to see the world as you see it.

In multi-class classification, accuracy is determined by the number of correct predictions divided by the total number of examples.

In binary classification, or for mutually exclusive classes, accuracy is determined by the number of true positives added to the number of true negatives, divided by the total number of examples.

Once we have established the goal we are working towards with the ground truth, we begin to assess your model’s prediction returns. This is a completely subjective question, and most clients simply want to know that their models will perform to their standards once it is in the real world.

We begin by running a test set of images through the model and reading their precision and recall scores. The test set of images should be:

  • a) inputs that the model has not been trained with, and;
  • b) be the same kind of data we would expect to see in the model’s particular use case.

Once we have our precision or recall scores, we will compare these to the model’s recall or precision thresholds for .5 and .8, respectively.