Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm is trained on a dataset without any labeled output. Unlike supervised learning, there is no specific target variable to predict or classify. The algorithm is instead given a dataset and is tasked with finding patterns, similarities, and differences among the data points.
In unsupervised learning, the algorithm tries to identify the underlying structure of the data by clustering similar data points together or reducing the dimensionality of the dataset. Common techniques used in unsupervised learning include clustering algorithms such as K-means clustering and hierarchical clustering, and dimensionality reduction techniques such as principal component analysis (PCA) and singular value decomposition (SVD).
Unsupervised learning is often used for exploratory data analysis, data preprocessing, and feature extraction. It can also be useful in identifying anomalies or outliers in a dataset. However, since there is no predefined output to measure the accuracy of the algorithm, evaluating the effectiveness of an unsupervised learning algorithm can be challenging.
One example of unsupervised learning is clustering, where the goal is to group similar data points together based on their features. A commonly used clustering algorithm is K-means clustering.
Suppose we have a dataset of n data points, each with m features, represented as an n x m matrix X. We want to group these data points into k clusters based on their feature similarity. Here's how K-means clustering works:
Initialize k cluster centroids randomly.
Assign each data point to the nearest centroid based on the Euclidean distance between the data point and the centroid.
Recalculate the centroid of each cluster based on the mean of the feature values of the data points assigned to that cluster.
Repeat steps 2 and 3 until the centroids no longer change or a maximum number of iterations is reached.
Here's an example using a toy dataset with 6 data points and 2 features:
X = [[1, 4],
[2, 2],
[2, 5],
[5, 1],
[6, 2],
[7, 3]]
Suppose we want to cluster these data points into k=2 clusters. We can initialize the centroids randomly:
centroid1 = [2, 2]
centroid2 = [6, 3]
We then assign each data point to the nearest centroid:
Cluster 1: [1, 4], [2, 2], [2, 5]
Cluster 2: [5, 1], [6, 2], [7, 3]
We recalculate the centroids:
centroid1 = [1.67, 3.67]
centroid2 = [6, 2]
We reassign the data points:
Cluster 1: [1, 4], [2, 2], [2, 5]
Cluster 2: [5, 1], [6, 2], [7, 3]
We recalculate the centroids:
centroid1 = [1.67, 3.67]
centroid2 = [6, 2.33]
We reassign the data points:
Cluster 1: [1, 4], [2, 2], [2, 5]
Cluster 2: [5, 1], [6, 2], [7, 3]
The centroids no longer change, so we stop. The final clusters are:
Cluster 1: [1, 4], [2, 2], [2, 5]
Cluster 2: [5, 1], [6, 2], [7, 3]
In this example, K-means clustering was used to group similar data points together based on their features without any labeled output.