Welcome to the first AI in 5 post, where we teach you how you can create amazing things in just 5 minutes! This tutorial is based on this video, which is a step-by-step guide on using a large language model to build a text classification model.
Text classification is a common task in Natural Language Processing that assigns a set of predefined categories to open-ended text and in this demonstration, we'll use Cohere AI's embedding model to capture semantic relationships and classify different types of test questions from the Student Questions dataset. While the dataset contains around 120,000 questions, we'll work with a smaller subset of 5,000 questions for simplicity.
Let's start by taking a look at the dataset we're working with. We'll be using the student questions dataset which contains approximately 120,000 test questions. However, to optimize the learning experience, we'll narrow them down to about 5,000 test questions.
The dataset is a structured CSV file with two columns: 'text' and 'label.' The 'text' column contains the question text, and the 'label' column contains the category of the question, which can be one of four subjects: physics, chemistry, biology, or math.
Let’s Start with a Data Conversion using the Python script to convert and prepare our dataset for classification. This script also helps us split the data into training and testing sets.
First we need to specify if there are columns with multiple values. In our scenario, they don't, so our response will be a 'no'.
Next, the script looks for the column with the text, which in this case is the first column. It can also recognize multiple categories in the second column, such as chemistry, math, biology, and physics. Also, it determines that there is only one more column besides the 'text' column, so it automatically selects 'labels' as the column for labels.
The Python tool then asks if we want to divide our dataset into a training set and a testing set. We agree with this and choose 'yes.' We make this choice because we want to see how well our model performs on new and unseen data.
Also, there's no need to shuffle the data for this particular project, we'll respond with a 'no' when asked to do so. Also for splitting the dataset we are dividing the data into training and testing sets, eliminating the need for a validation set.
Now with all these responses, 80% of the dataset is dedicated to training and rest for testing. Now the data is neatly arranged into two distinct files: a training set and a testing set.
First let’s create a new application. Signup to Clarifai here and create a new App by specifying the App ID, Short Description and selecting the Base Workflow.
Here we have set the base workflow as Text which is a single-model workflow of text embedding model for general english text.
Now we have to change this to Cohere Text Workflow. So go the workflow section and copy the base workflow which is Text and rename it as 'Text-Cohere’ and also by changing the Text Embedder from multilingual-text-embedding to cohere-text-to-embeddings model.
Now save the workflow and go to the App settings to change the base workflow from Text to Text-Cohere.
Now let’s upload the Training and Test data. Go to the Inputs section in the Sidebar and click on upload to upload the training and testing data.
It takes a while to upload the data since each and every text input you upload will be passed through the Cohere embedding model to process them.
Once the data is uploaded, select all the data and add them to the training dataset.
Select all the search results, add a new dataset train and then click on Add inputs. This will ensure that all the uploaded data is under the dataset named train. Follow the similar steps to upload the Test Dataset.
Now, let’s train our text classification model. First, go to Models Section in your Application and select the Create Model option on the Top Right Corner
Now select the option Transfer Learning Classifier
Now specify the Model Id, choose the 'train' dataset, and choose ALL the concepts in the training dataset with labels and hit Train.
Once the Training is done, we evaluate the model's performance on both the Training and testing datasets. Here's the results on the Training Dataset.
Since the model is already trained on this dataset it achieves high scores for ROC/AUC, Precision, Recall, and F1 Score.
On the test data, which contains the examples it hasn't seen before, it still performs well, given the limited subset used for training.
And that's how to use Clarifai's platform to train a text classification model with Cohere AI's embedding model on a text dataset. We've shown data preprocessing, model creation, training, and performance evaluation. Thanks for reading!