import pandas as pd
import numpy as np
import sklearn as skl # includes both k-means and GMM methods
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization
Exercises - Basics of Clustering with K-Means and GMM
Basics of Python for Data Science
From raw data to cluster profiles
Let’s do some basic unsupervised machine learning / clustering analysis. We will use K-Means and Gaussian Mixture Models (GMM) to group participants based on their standardized test scores or questionnaire responses
Scenario
You must analyze a dataset containing some numerical scores per participant. Your goal is to learn how to run k-means clustering and GMM to group participants (i.e., multivariate observations) based on their features, without any prior labels. You could do it with any dataset available online. Below, the exercise is done using the final output (simulated) dataset from the “data nightmare” exercise (you can download it from here: dataForCluster.csv
)
Your Final Goals
- Run clustering with a different number
k
of components; - Decide the best number of clusters using appropriate metrics (e.g., silhouette values for K-Means; BIC for GMM);
- Compare clustering results between K-Means and GMM;
- Inspect the average profiles of resulting clusters with data visualization.
Script
Preliminary steps
Import packages
Import data
= pd.read_csv("data/dataForCluster.csv") dfTot
Take only numeric columns, and drop rows with any missing value
= dfTot.select_dtypes(include=np.number).dropna() dfNum
K-Means
For k-means, it’s essential that features have the same variance (otherwise some features might be weighted enormously more than others in the clustering… but try to skip it and see what changes, if you like 🙂). Additionally, features should ideally be orthogonal, otherwise k-means tend to group observations along correlated features. So, preliminary PCA would be warranted! Now we skip it to avoid making this too long, but know that in real research this would be important!
# Standardize all numeric features (mean=0, std=1)
= skl.preprocessing.StandardScaler().fit_transform(dfNum)
dfNumStd
= pd.DataFrame(dfNumStd)
dfNumStd = dfNum.columns dfNumStd.columns
Compute silhouette scores for various numbers of components k
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
= range(2, 7) # from 2 to 6
Ks = [] # initialize list of results
silhouetteValues
for k in Ks:
# create k-means object
= KMeans(n_clusters=k)
kmeans # predict classification using data
= kmeans.fit_predict(dfNumStd)
classification # compute and store silhouette score for k
silhouetteValues.append(silhouette_score(dfNumStd, classification))
# Plot silhouette scores
=(11,6))
plt.figure(figsize=list(Ks), y=silhouetteValues, scale=1.5)
sns.pointplot(x"Silhouette scores vs k",fontsize=20)
plt.title("k",fontsize=18)
plt.xlabel("Silhouette score",fontsize=18)
plt.ylabel(=16);
plt.xticks(fontsize=16);
plt.yticks(fontsize; plt.close() plt.show()
- According to this plot, what is the most appropriate number of clusters?
- Note that the silhouette score could not be computed for
k = 1
, which is a relevant limitation when using k-means, because it leaves it unclear whether the “just one population” solution is actually best; alternatives could be computing the Duda-Hart test under some assumptions, or computing additional statistics (e.g., the within-cluster sum of squares (WCSS) /.inertia_
) and inspect the plot with the “elbow method”
Plot cluster profiles when k = 2
# Fit K-Means with k = 2 and take classifications/labels
= KMeans(n_clusters=2)
kmeans = kmeans.fit_predict(dfNumStd)
classification
# Compute mean profile of each cluster
= pd.DataFrame(dfNumStd).groupby(classification).mean()
clusterMeans = dfNumStd.columns
clusterMeans.columns
# Plot average standardized scores for each cluster (".T" is for transposing)
=clusterMeans.T, markers=True, markersize=10)
sns.lineplot(data"Cluster profiles (KMeans, k=2)", fontsize=18)
plt.title("z-score",fontsize=16)
plt.ylabel("",fontsize=16)
plt.xlabel(=14);
plt.xticks(fontsize=14);
plt.yticks(fontsize; plt.close() plt.show()
Gaussian Mixture Models
GMM is more flexible than K-Means. It is does not assume equal variances across clusters and can model (because it’s actually a model) correlations between features (i.e., full covariance structures). It can also be fitted for k = 1
, so you can formally compare models with different numbers of components, starting from one, using criteria like BIC
or AIC
. Its main limitation is the assumption that data within each cluster follow a Gaussian distribution.
Compute BIC values for various numbers of components k
from sklearn.mixture import GaussianMixture
= []
BICs = range(1, 7)
Ks
for k in Ks:
# create and fit GMM object
= GaussianMixture(n_components=k).fit(dfNumStd)
gmm # extract and store BIC
BICs.append(gmm.bic(dfNumStd))
# Plot BIC values
=list(Ks), y=BICs, scale=1.5)
sns.pointplot(x"BIC values for different GMMs",fontsize=18)
plt.title("Number of Components",fontsize=16)
plt.xlabel("BIC",fontsize=16)
plt.ylabel(=14);
plt.xticks(fontsize=14);
plt.yticks(fontsize; plt.close() plt.show()
Plot cluster profiles when k = 2
= GaussianMixture(n_components = 2).fit(dfNumStd)
gmm = pd.DataFrame(gmm.means_)
clusterMeans = dfNumStd.columns
clusterMeans.columns
=clusterMeans.T, markers=True, markersize=10)
sns.lineplot(data"Cluster profiles (GMM)", fontsize=18)
plt.title("z-score",fontsize=16)
plt.ylabel("",fontsize=16)
plt.xlabel(=14);
plt.xticks(fontsize=14);
plt.yticks(fontsize; plt.close() plt.show()
Plot cluster profiles when k = 3
= GaussianMixture(n_components = 3).fit(dfNumStd)
gmm = pd.DataFrame(gmm.means_)
clusterMeans = dfNumStd.columns
clusterMeans.columns
=clusterMeans.T, markers=True, markersize=10)
sns.lineplot(data
"Cluster profiles (GMM)", fontsize=18)
plt.title("z-score",fontsize=16)
plt.ylabel("",fontsize=16)
plt.xlabel(=14);
plt.xticks(fontsize=11);
plt.yticks(fontsize; plt.close() plt.show()
Extra: Dimensionality reduction and 2-D scatter plots
The following chunk runs a GMM with k = 3
clusters. Then, runs a PCA to reduce the data to two dimensions for visualization, and finally shows the GMM-predicted clusters in a 2-D scatterplot.
# Fit GMM with 3 components
= GaussianMixture(n_components=3).fit(dfNumStd)
gmm = gmm.predict(dfNumStd).astype(str)
clusters
# Reduce data to 2-D using PCA for plotting
from sklearn.decomposition import PCA
= pd.DataFrame( PCA(n_components=2).fit_transform(dfNumStd) )
coordinates
# Plot PCA projection as 2-D scatterplot
=coordinates[0], y=coordinates[1], hue=clusters, s=70, alpha=.8)
sns.scatterplot(x
"PCA 2D projection of GMM clusters", fontsize=18)
plt.title("PCA Component 1", fontsize=16)
plt.xlabel("PCA Component 2", fontsize=16)
plt.ylabel(=13);
plt.xticks(fontsize=13);
plt.yticks(fontsize=12)
plt.legend(fontsize; plt.close() plt.show()
What could you do, now?
- Repeat using your own dataset, or a dataset publicly available from OSF, Zenodo, Figshare, or elsewhere.
- Repeat k-means, but first perform PCA on the dataset to ensure you use orthogonal features.
- Generate data (e.g., with
np.random.normal()
) to simulate different clusters, and see whether the clustering methods return what you expect.
Clustering techniques such as K-Means and Gaussian Mixture Models (GMM) will always assign data points to clusters, even when no true underlying structure exists, which can create an illusion of meaningful subgroups.
Avoid interpreting clusters as real psychological types unless have very strong theoretical and empirical justification.
Personally, I believe that in virtually no case clustering can meaningfully reveal any real underlying “category” when using behavioral data, e.g., here and here.