Exercises - Basics of Clustering with K-Means and GMM

Basics of Python for Data Science

From raw data to cluster profiles

Let’s do some basic unsupervised machine learning / clustering analysis. We will use K-Means and Gaussian Mixture Models (GMM) to group participants based on their standardized test scores or questionnaire responses

Scenario

You must analyze a dataset containing some numerical scores per participant. Your goal is to learn how to run k-means clustering and GMM to group participants (i.e., multivariate observations) based on their features, without any prior labels. You could do it with any dataset available online. Below, the exercise is done using the final output (simulated) dataset from the “data nightmare” exercise (you can download it from here: dataForCluster.csv)

Your Final Goals

Run clustering with a different number k of components;
Decide the best number of clusters using appropriate metrics (e.g., silhouette values for K-Means; BIC for GMM);
Compare clustering results between K-Means and GMM;
Inspect the average profiles of resulting clusters with data visualization.

Script

Preliminary steps

Import packages

import pandas as pd
import numpy as np
import sklearn as skl  # includes both k-means and GMM methods
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization

Import data

dfTot = pd.read_csv("data/dataForCluster.csv")

Take only numeric columns, and drop rows with any missing value

dfNum = dfTot.select_dtypes(include=np.number).dropna()

K-Means

For k-means, it’s essential that features have the same variance (otherwise some features might be weighted enormously more than others in the clustering… but try to skip it and see what changes, if you like 🙂). Additionally, features should ideally be orthogonal, otherwise k-means tend to group observations along correlated features. So, preliminary PCA would be warranted! Now we skip it to avoid making this too long, but know that in real research this would be important!

# Standardize all numeric features (mean=0, std=1)
dfNumStd = skl.preprocessing.StandardScaler().fit_transform(dfNum)

dfNumStd = pd.DataFrame(dfNumStd)
dfNumStd.columns = dfNum.columns

Compute silhouette scores for various numbers of components k

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Ks = range(2, 7) # from 2 to 6
silhouetteValues = [] # initialize list of results

for k in Ks:
    # create k-means object
    kmeans = KMeans(n_clusters=k) 
    # predict classification using data
    classification = kmeans.fit_predict(dfNumStd) 
    # compute and store silhouette score for k
    silhouetteValues.append(silhouette_score(dfNumStd, classification))

# Plot silhouette scores
plt.figure(figsize=(11,6))
sns.pointplot(x=list(Ks), y=silhouetteValues, scale=1.5)
plt.title("Silhouette scores vs k",fontsize=20)
plt.xlabel("k",fontsize=18)
plt.ylabel("Silhouette score",fontsize=18)
plt.xticks(fontsize=16);
plt.yticks(fontsize=16);
plt.show(); plt.close()

According to this plot, what is the most appropriate number of clusters?
Note that the silhouette score could not be computed for k = 1, which is a relevant limitation when using k-means, because it leaves it unclear whether the “just one population” solution is actually best; alternatives could be computing the Duda-Hart test under some assumptions, or computing additional statistics (e.g., the within-cluster sum of squares (WCSS) / .inertia_) and inspect the plot with the “elbow method”

Plot cluster profiles when k = 2

# Fit K-Means with k = 2 and take classifications/labels
kmeans = KMeans(n_clusters=2)
classification = kmeans.fit_predict(dfNumStd)

# Compute mean profile of each cluster
clusterMeans = pd.DataFrame(dfNumStd).groupby(classification).mean()
clusterMeans.columns = dfNumStd.columns

# Plot average standardized scores for each cluster (".T" is for transposing)
sns.lineplot(data=clusterMeans.T, markers=True, markersize=10)
plt.title("Cluster profiles (KMeans, k=2)", fontsize=18)
plt.ylabel("z-score",fontsize=16)
plt.xlabel("",fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
plt.show(); plt.close()

Gaussian Mixture Models

GMM is more flexible than K-Means. It is does not assume equal variances across clusters and can model (because it’s actually a model) correlations between features (i.e., full covariance structures). It can also be fitted for k = 1, so you can formally compare models with different numbers of components, starting from one, using criteria like BIC or AIC. Its main limitation is the assumption that data within each cluster follow a Gaussian distribution.

Compute BIC values for various numbers of components k

from sklearn.mixture import GaussianMixture

BICs = []
Ks = range(1, 7)

for k in Ks:
    # create and fit GMM object
    gmm = GaussianMixture(n_components=k).fit(dfNumStd)
    # extract and store BIC
    BICs.append(gmm.bic(dfNumStd))

# Plot BIC values
sns.pointplot(x=list(Ks), y=BICs, scale=1.5)
plt.title("BIC values for different GMMs",fontsize=18)
plt.xlabel("Number of Components",fontsize=16)
plt.ylabel("BIC",fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
plt.show(); plt.close()

Plot cluster profiles when k = 2

gmm = GaussianMixture(n_components = 2).fit(dfNumStd)
clusterMeans = pd.DataFrame(gmm.means_)
clusterMeans.columns = dfNumStd.columns

sns.lineplot(data=clusterMeans.T, markers=True, markersize=10)
plt.title("Cluster profiles (GMM)", fontsize=18)
plt.ylabel("z-score",fontsize=16)
plt.xlabel("",fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
plt.show(); plt.close()

Plot cluster profiles when k = 3

gmm = GaussianMixture(n_components = 3).fit(dfNumStd)
clusterMeans = pd.DataFrame(gmm.means_)
clusterMeans.columns = dfNumStd.columns

sns.lineplot(data=clusterMeans.T, markers=True, markersize=10)

plt.title("Cluster profiles (GMM)", fontsize=18)
plt.ylabel("z-score",fontsize=16)
plt.xlabel("",fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=11);
plt.show(); plt.close()

Extra: Dimensionality reduction and 2-D scatter plots

The following chunk runs a GMM with k = 3 clusters. Then, runs a PCA to reduce the data to two dimensions for visualization, and finally shows the GMM-predicted clusters in a 2-D scatterplot.

# Fit GMM with 3 components
gmm = GaussianMixture(n_components=3).fit(dfNumStd)
clusters = gmm.predict(dfNumStd).astype(str)

# Reduce data to 2-D using PCA for plotting
from sklearn.decomposition import PCA
coordinates = pd.DataFrame( PCA(n_components=2).fit_transform(dfNumStd) )

# Plot PCA projection as 2-D scatterplot
sns.scatterplot(x=coordinates[0], y=coordinates[1], hue=clusters, s=70, alpha=.8)

plt.title("PCA 2D projection of GMM clusters", fontsize=18)
plt.xlabel("PCA Component 1", fontsize=16)
plt.ylabel("PCA Component 2", fontsize=16)
plt.xticks(fontsize=13);
plt.yticks(fontsize=13);
plt.legend(fontsize=12)
plt.show(); plt.close()

What could you do, now?

Repeat using your own dataset, or a dataset publicly available from OSF, Zenodo, Figshare, or elsewhere.
Repeat k-means, but first perform PCA on the dataset to ensure you use orthogonal features.
Generate data (e.g., with np.random.normal()) to simulate different clusters, and see whether the clustering methods return what you expect.

Interpret clustering with caution

Clustering techniques such as K-Means and Gaussian Mixture Models (GMM) will always assign data points to clusters, even when no true underlying structure exists, which can create an illusion of meaningful subgroups.

Avoid interpreting clusters as real psychological types unless have very strong theoretical and empirical justification.

Personally, I believe that in virtually no case clustering can meaningfully reveal any real underlying “category” when using behavioral data, e.g., here and here.