One topic in AI and Machine Learning that can be confusing is the differences between supervised and unsupervised learning, and why we would ever want to use an unsupervised system. In this blog post, I’ll discuss these learning paradigms so that you can best get started. Let’s begin!
What’s the difference between Supervision levels?
While I talk about the differences between these levels of supervision, I’m going to put it in terms of a thought experiment. Let's say that I love cooking with mushrooms, and I live in the countryside where there are many edible, delicious kinds of mushrooms I can eat. The problem is some of the mushrooms are poisonous, and they’ll make me sick if I eat them. I know that they look different, but I don’t quite know what they look like.
Supervised Learning is the class of algorithms that use a label associated with data instances to learn. My friend Mario is a Subject Matter Expert (SME) in mushrooms; he can tell me which ones are good for me, and which ones will ruin my day. The problem is that Mario’s expertise isn’t free; he wants me to pay him to tell me if a mushroom is poisonous or not. Since I know nothing about mushrooms, I’ll have to spend a lot of money getting him to label my mushrooms before I have a good model in my head of what a poisonous mushroom looks like, and what a healthy mushroom looks like. I think that it will take about 20 examples of poisonous mushrooms, and 20 examples of safe mushrooms to eat before I feel confident doing this task without Mario by my side. I worry that I won’t be able to afford Mario’s expensive data labeling fees, but also worry that if I don’t pay, I’ll have a bad model and will end up having unpleasant dinners more than I would like. Common types of Supervised Learning algorithms are Decision Trees, Support Vector Machines, and Artificial Neural Networks.
Unsupervised Learning is the class of algorithms that you can use that do not require labeled data to learn. This is what I’ve described above; there are mushrooms, and I can see the features of the mushroom (height, shape, color,) but I don’t know if it’s poisonous or not unless I eat it. What I can do, however, is start learning the different types of mushrooms, and combinations of features. For instance, maybe there is a common theme of red mushrooms that are taller than 3 inches, and another of green mushrooms with a very flat cap. Without knowing whether these mushrooms are poisonous or not, I’m able to make common groups of mushrooms or clusters of similar mushrooms (clustering is one of the more common types of unsupervised learning.) An advantage of this is that I don’t need labeled data to get started, I can quickly (and cheaply since I don't need to pay an expert!) become very good at sorting mushrooms, and grouping them together with their own kind; when I see a new mushroom I’ll be able to instantly say “Aha! This looks like these that I’ve seen before.” A disadvantage to this method is I don’t know anything. I was able to take 100 mushrooms and quickly divide them into 10 different types, but I don’t know which types belong in my pasta, and which types belong in the trash. Common types of Unsupervised Learning algorithms include Clustering, Principal Component Analysis (PCA), and Singular Value Decomposition (SVD).
As we saw above, if I use a supervised approach, I’ll get good quality data and be able to determine the toxic content of the mushrooms, but I'll need a lot of high quality labeled data. On the other hand, if I use unsupervised data, I’ll be able to quickly make groupings of mushrooms that are logical without any labeled data, but I won’t know if any of them are safe. Fortunately, there’s an in-between. Semi-supervised Learning is a two-tiered approach; we initialize our model in an unsupervised manner, and then we fine-tune it using supervised data. In our unsupervised example, I was able to make 10 clusters of mushrooms that had similar features; what if I only paid Mario to label 10 of my mushrooms instead of 40? Well now, when I get a new mushroom, I can assign it to one of my 10 groups (because I know a lot about what mushrooms look like!), and then I can deduce “Mario said that one mushroom in this cluster was safe, and the rest are very similar. I think they will all be safe!” Mario is less happy because he didn’t get as much business from me, but I’m able to eat safely at one quarter the original cost!
In practice, semi-supervised learning has been shown to reduce the number of labeled training examples you will need to produce a model that is of the same accuracy as one with supervised learning. Labeled data can be expensive, so the most cost-effective way to train a model is to use semi-supervised learning when possible.
How are these used in the Real World?
Here I’ll explore how I’ve used semi-supervised learning in my research over the years.
EEG Case Study
This is a summary of the methodology used in two of the author’s papers: 1, 2
When I was in graduate school for my master’s, I was working on a problem with data from the Boston Children’s Hospital for seizure detection. An EEG is used to monitor electrical brain activity, and a skilled medical professional will be able to look at an EEG signal and tell if the child wearing the EEG on their head is having a seizure or not. The problem is that it can be difficult to determine if the person is having a seizure from electrical signals alone using a computer (in this experiment there were 23 EEG channels sampling at 256 Hz for a total of 5,888 every second.) It requires special training and experience to look at an EEG signal and make the determination. What is not difficult to obtain is unlabeled EEG data; you can just record data as the EEG runs and be able to learn the features of the signal (like peak frequency, peak variation) such that we can cluster instances of the EEG that are similar. By training an autoencoder (unsupervised) on days of EEG data, and then fine-tuning it by using a support vector machine (supervised,) I was able to classify seizures more accurately than if I were to only use labeled data. Also, I needed less labeled data to achieve the same accuracy.
Supervised Learning on a Budget
If you’ve heard of computer vision and deep learning, you probably know the largest criticism of it is it requires an enormous amount of labeled data to be effective. That is true. Further, as I said above, using an SME to label all that data for you can be very expensive and time-consuming, depending on your use-case. While semi-supervised learning offers companies one way around this problem, working with an AI provider is another option you should consider. Clarifai, for instance, already has an enormous amount of labeled data that you can leverage for making a custom supervised model with only a small amount of labeled data. That means you can get the benefits of expertly labeled data, i.e. good computer vision models, without breaking the bank.