Previously, we learned how to use Supervised Learning to make predictions based on labelled data. In this lesson, we'll cover a new topic called clustering.
What is clustering?
Clustering is a set of machine learning algorithms that divide data into categories, called clusters. Clustering can help us see patterns in messy datasets. Machine learning Scientists use clustering to divide customers into segments, images into categories, or behaviours into typical and anomalous.
Supervised versus Unsupervised learning
Clustering is a part of broader category within Machine Learning called "Unsupervised Learning"
Unsupervised learning differs from supervised learning in the structure of the training data. While supervised learning uses data with features and labels. Unsupervised learning uses data with only features.
This makes unsupervised learning, and clustering, particularly appealing: you can use it even when you don't know much about your dataset.
Case: Customer Segmentation
Let's dive into an example of clustering for customer segmentation. Customer segmentation is the process of dividing a pool of customers into different groups with common attributes.
We can use these segments to devise targeted advertising campaigns or to explain otherwise confusing results by analysing the behaviour of individual segments, rather than just looking at the customers as a whole.
First, we need to brainstorm a list of features that will accurately describe our customers. Let's say we are working for an airline and our customers are travellers. Important features might include the number of flights they've taken in the past year, the percentage of those flights that were international, how far in advance they typically buy tickets, and what percent of tickets were business class.
Some clustering algorithms need us to define how many clusters we want to create. The number of clusters we ask for greatly affects how algorithm will segment our data.
Here's an example of some of our flights data. The x-axis represents the number of flights per year and the y-axis represents the percentage of those flights that were business class.
Here's how the algorithm divides the data if we ask for two clusters.
And here is how it divides the same data if we ask for three clusters.
Having a strong hypothesis about our data helps us get better results from the clustering algorithm. For our airlines example, we might expect to have business travellers, family travellers, and adventurers: three clusters, as in the image.