Exercises - Basics of Text Embeddings (plus PCA and Clustering)

Basics of Python for Data Science

Embeddings

Text embeddings are numerical vectors that encode the semantic meaning of words, sentences, documents. They are a core component of AI language models: AI models actually process language, perform reasoning, and get relationships and analogies by running numerical computations on embeddings. It’s all about embeddings

Scenario

In this exercise, we will use embeddings to analyze a series of short emotional texts (they were actually generated by GPT using this Python code if you are curious about that)

We want to:

Compute sentence embeddings using a pretrained model;
Use PCA (Principal Component Analysis) to reduce the dimensionality of the embeddings to make them usable for our analysis (clustering) and sample size;
Use clustering to group sentences based on similar semantical (emotional) cores;
Plot the results in a 2-D space, well, because it looks cool in a report 🙂

To generate the embeddings, we will use a model called "sentence-transformers/all-MiniLM-L6-v2" from the Sentence Transformers library and available on our beloved HuggingFace platform

Preliminary Steps

# recommended if not already done
# !pip install sentence-transformers

# import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sentence_transformers import SentenceTransformer

Import and Inspect the Dataset

You can download the dataset from: emotionalTexts1.csv

df = pd.read_csv("data/emotionalTexts1.csv")
df.head()

                                                text
0  As I opened the door, the familiar scent of ci...
1  As the door slammed shut behind me, my fists c...
2  Walking into the kitchen, I expected silence, ...
3  As I opened the gift, my eyes widened and a gr...
4  As I turned the corner, I found my old college...

The only thing that we actually need is the column named "text", which contains short emotionally charged sentences

Generate Sentence Embeddings

Let’s now use a pretrained model to generate embeddings for these texts

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(df["text"], show_progress_bar=False);
embeddings = pd.DataFrame(embeddings)
embeddings.round(3)

       0      1      2      3      4    ...    379    380    381    382    383
0   -0.017  0.026  0.038  0.105  0.107  ... -0.098  0.136  0.031 -0.064  0.008
1    0.039  0.035  0.004  0.063  0.106  ...  0.015  0.100  0.007 -0.112  0.015
2    0.029  0.128  0.006  0.061  0.102  ...  0.030  0.076  0.047 -0.127 -0.047
3   -0.061  0.103  0.086  0.098  0.112  ... -0.010  0.013  0.033 -0.129  0.013
4   -0.077  0.029  0.063  0.035  0.025  ... -0.037  0.038 -0.015 -0.134  0.043
..     ...    ...    ...    ...    ...  ...    ...    ...    ...    ...    ...
351  0.032  0.028  0.013  0.066  0.060  ... -0.012  0.101  0.062 -0.091 -0.051
352  0.040 -0.038  0.057  0.072  0.062  ...  0.016  0.097 -0.029 -0.105 -0.014
353  0.050  0.012  0.023  0.069  0.113  ...  0.013  0.113 -0.002 -0.075 -0.009
354  0.035 -0.018  0.057  0.066  0.105  ... -0.051  0.080  0.117 -0.066  0.013
355  0.031  0.030  0.020  0.049  0.167  ... -0.007  0.076  0.059 -0.125 -0.040

[356 rows x 384 columns]

Good! So now we have this array that has \(356\) rows (our sentences) and \(384\) columns (their embeddings). Wow! We have more features (variables) than observations! 😬

Now, we want to perform clustering on the text embeddings but we have two problems:

Too many variables (actually, more features than observations!);
Possibility of highly correlated/redundant embeddings in our text (which is especially problematic for the k-means clustering algorithm).

What can we do?

Principal Component Analysis

PCA resolves both of the above problems, as it 1) reduces dimensionality; 2) returns orthogonal features

Let’s run PCA() with all possible components (which is the maximum across rows and columns), and see how much variance they explain

pca = PCA()
pca_emb = pca.fit_transform(embeddings)
pca_expl_var = pca.explained_variance_ratio_

plt.figure(figsize=(10, 6))
plt.plot(range(1,len(pca_expl_var)+1), pca_expl_var, marker="o")
plt.title("Explained Variance by PCA Components", fontsize=18)
plt.xlabel("Number of Principal Components", fontsize=16)
plt.ylabel("Explained Variance", fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
plt.grid(True)
plt.show()

Question: based on the elbow method applied to the above plot of the explained variance by component, how many components should we retain?

Actually Reduce Embeddings for Clustering

pca20 = PCA(n_components=20)
pca20_emb = pca20.fit_transform(embeddings)

Run k-means

Let’s run k-means algorithm with different number of clusters (from 2 to 10), and see the silhouette score, a common statistics used for selecting the optimal number of clusters (higher is better)

silhouetteValues = []
Ks = range(2,11)

for k in Ks:
       kmeans = KMeans(n_clusters=k, random_state=0)
       classification = kmeans.fit_predict(pca20_emb)
       silScore = silhouette_score(X=pca20_emb, labels=classification)
       silhouetteValues.append(silScore)

# let's plot 
plt.figure(figsize=(10, 6))
sns.pointplot(x=Ks, y=silhouetteValues)
plt.title("Silhouette scores vs Number of clusters (k)",fontsize=18)
plt.xlabel("Number of clusters (k)",fontsize=16)
plt.ylabel("Silhouette Score",fontsize=16)
plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
plt.show()

plt.clf(); plt.close()

The best silhouette value is 5, but other numbers are very close. For now, let’s go on with k = 5:

kmeans = KMeans(n_clusters=5, random_state=0)
classification = kmeans.fit_predict(pca20_emb)

df["cluster"] = classification

2-D Visualization of Text Clusters

To help interpret the results, but also to get a nice, colorful plot, 😉 let’s further reduce the embeddings to 2-D, plot all the observations on this 2-D space, coloring them according to their "cluster"

pca2D = PCA(n_components=2)
coordinates = pd.DataFrame( pca2D.fit_transform(embeddings) )

plt.figure(figsize=(10, 6))
sns.scatterplot(x=coordinates[0], y=coordinates[1], hue=df["cluster"],
                palette="tab10", s=100, alpha=0.8)

plt.title("Emotional Texts Clusters in 2D PCA Space", fontsize=20)
plt.xlabel("Principal Component 1", fontsize=20)
plt.ylabel("Principal Component 2", fontsize=20)
plt.xticks(fontsize=16);
plt.yticks(fontsize=16);
plt.legend(title="Cluster", fontsize=14, title_fontsize=16)
plt.show()

What could you do now?

Interpretation! Inspect a few sentences for each cluster: how are they similar?
Try to extract different numbers of clusters (e.g., only 2);
Try to replace the model with others available on HuggingFace in the sentence-transformers library;
Translate the texts to other languages and see if the results are robust;
Try to apply all of the above with other datasets of texts of your own

Be careful when interpreting clusters

Clusters were derived entirely unsupervised from embeddings and dimensionality reduction. The model is not trying to match them explicitly with any human “emotion categories”. An alternative, if you have “labels” could be to match each sentence with the most closely corresponding “label”…