Logo

Exercises – Basics of Text Classification using Embeddings and Cosine Similarity

Basics of Python for Data Science

Cosine similarity

Cosine similarity is conceptually similar to correlation, and it provides an index of how aligned two vectors are; it ranges in \([-1, +1]\). We can compute this similarity across sentence embeddings: the more similar a sentence embedding is to the embedding of a reference concept, the more likely it is that the sentence expresses that concept.

Scenario

We want to classify emotional sentences based on their similarity to a set of reference emotion words in {"joy", "sadness", "anger", "fear", "surprise", "disgust"}. To do so, we compute the cosine similarity between the embedding of each sentence and the embeddings of the reference emotion words. The idea is that the reference emotion word whose embedding (thus, meaning) is most similar/aligned with a sentence embedding will present the highest cosine similarity to it.

Preliminary Steps

# recommended if not already done
# !pip install sentence-transformers

# import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

Import Dataset, Load Pretrained Model, Generate Sentence Embedding

You can download the dataset from: emotionalTexts1.csv. The following steps are the same as in the previous exercise:

Import dataset

df = pd.read_csv("data/emotionalTexts1.csv")
df.head()
                                                text
0  As I opened the door, the familiar scent of ci...
1  As the door slammed shut behind me, my fists c...
2  Walking into the kitchen, I expected silence, ...
3  As I opened the gift, my eyes widened and a gr...
4  As I turned the corner, I found my old college...

Load pretrained model (once again, let’s use "all-MiniLM-L6-v2" from Sentence Transformers)

model = SentenceTransformer("all-MiniLM-L6-v2")

Generate Sentence Embeddings

sentenceEmbeddings = model.encode(df["text"])

sentenceEmbeddings = pd.DataFrame(sentenceEmbeddings)
sentenceEmbeddings.round(3)
       0      1      2      3      4    ...    379    380    381    382    383
0   -0.017  0.026  0.038  0.105  0.107  ... -0.098  0.136  0.031 -0.064  0.008
1    0.039  0.035  0.004  0.063  0.106  ...  0.015  0.100  0.007 -0.112  0.015
2    0.029  0.128  0.006  0.061  0.102  ...  0.030  0.076  0.047 -0.127 -0.047
3   -0.061  0.103  0.086  0.098  0.112  ... -0.010  0.013  0.033 -0.129  0.013
4   -0.077  0.029  0.063  0.035  0.025  ... -0.037  0.038 -0.015 -0.134  0.043
..     ...    ...    ...    ...    ...  ...    ...    ...    ...    ...    ...
351  0.032  0.028  0.013  0.066  0.060  ... -0.012  0.101  0.062 -0.091 -0.051
352  0.040 -0.038  0.057  0.072  0.062  ...  0.016  0.097 -0.029 -0.105 -0.014
353  0.050  0.012  0.023  0.069  0.113  ...  0.013  0.113 -0.002 -0.075 -0.009
354  0.035 -0.018  0.057  0.066  0.105  ... -0.051  0.080  0.117 -0.066  0.013
355  0.031  0.030  0.020  0.049  0.167  ... -0.007  0.076  0.059 -0.125 -0.040

[356 rows x 384 columns]

This gives us a matrix of shape (n_sentences, embedding_dim), where each row is a numerical representation of a sentence.

Also, Compute Embeddings for the “target” emotions

Our chosen target emotions / emotional categories are {"joy", "sadness", "anger", "fear", "surprise", "disgust"}. However, to make them more interpretable and unambiguous in terms of the natural language meaning, and thus of the embeddings, let’s incorporate the target emotions into short sentences like {"a person feeling joy", "a person feeling sadness", "a person feeling anger", "a person feeling fear", "a person feeling surprise", "a person feeling disgust"}. This is useful because text embedding models are specifically trained to capture meaning from sentences rather than isolated words:

emotions = ["joy", "sadness", "anger", "fear", "surprise", "disgust"]
targetEmotions = ["a person feeling "+e for e in emotions]

emotionEmbeddings = model.encode(targetEmotions)

emotionEmbeddings = pd.DataFrame(emotionEmbeddings)
emotionEmbeddings.round(3)
     0      1      2      3      4    ...    379    380    381    382    383
0  0.017  0.026 -0.008 -0.039 -0.034  ...  0.037  0.011 -0.030 -0.014  0.040
1  0.047  0.031 -0.003  0.046 -0.004  ...  0.010  0.051  0.011 -0.031  0.031
2  0.014 -0.009 -0.020  0.025 -0.002  ...  0.014  0.084 -0.010 -0.081  0.022
3  0.056 -0.014 -0.011  0.052  0.045  ... -0.003 -0.004 -0.049 -0.035  0.073
4 -0.056  0.018  0.030  0.007  0.020  ...  0.011  0.050 -0.001 -0.095  0.002
5  0.043  0.006  0.044  0.020  0.014  ...  0.033  0.040 -0.002 -0.096 -0.027

[6 rows x 384 columns]

Compute Cosine Similarities

Now finally we can compute the cosine similarities! This allows us to compare each sentence (e.g., "Walking into the kitchen, I expected silence...") with the reference emotion “target” (e.g., "a person feeling joy")

similarityMatrix = cosine_similarity(sentenceEmbeddings, emotionEmbeddings)

similarityMatrix.round(3)
array([[0.205, 0.181, 0.183, 0.258, 0.327, 0.277],
       [0.237, 0.244, 0.264, 0.298, 0.292, 0.299],
       [0.285, 0.272, 0.127, 0.217, 0.45 , 0.284],
       ...,
       [0.076, 0.206, 0.171, 0.278, 0.234, 0.463],
       [0.275, 0.301, 0.217, 0.205, 0.243, 0.262],
       [0.275, 0.239, 0.191, 0.224, 0.379, 0.269]],
      shape=(356, 6), dtype=float32)

Visualize as heatmap

Perhaps a heatmap will be more effective for reporting and understanding 🙂

plt.figure(figsize=(8, 10))
plt.imshow(similarityMatrix[:25,:], aspect='auto', cmap='Blues')
plt.colorbar();
plt.title('Cosine Similarity Heatmap', fontsize=20)
plt.ylabel('Sentence #', fontsize=18)
plt.xticks(ticks=range(0,len(emotions)),
                        labels=emotions, fontsize=18, rotation=45);
plt.yticks(ticks=range(0,25),fontsize=14);
plt.show()

plt.clf(); plt.close()

The above heatmap suggests that, for example, sentence #2 is clearly associated with surprise (sentence is: "Walking into the kitchen, I expected silence, but instead I was greeted by an unexpected choir of 'Happy Birthday!' with smiling faces peeking out from every corner. My eyes widened as I took in the sight of balloons and a cake that I had surely not planned for"); while for example sentence #15 is clearly associated with joy (sentence is: The sun warmed my skin as I spun around in the open field, arms outstretched, feeling the breeze dance through my hair. Laughter bubbled up uncontrollably, as if it were the most natural sound in the world)

Predict the Closest Emotion

Now let’s assign each sentence to the emotion with the highest similarity using the .argmax() method

closestEmotion = similarityMatrix.argmax(axis=1)
df["predictedEmotion"] = [emotions[i] for i in closestEmotion]
df.head()
                                                text predictedEmotion
0  As I opened the door, the familiar scent of ci...         surprise
1  As the door slammed shut behind me, my fists c...          disgust
2  Walking into the kitchen, I expected silence, ...         surprise
3  As I opened the gift, my eyes widened and a gr...         surprise
4  As I turned the corner, I found my old college...         surprise

Question: Do the assigned emotions seem plausible, just by inspecting the text and predictions? Inspect a few of them!

Evaluate Against Ground Truth

Actually, there was a ground truth! 🙂 As I generated the sentences based on specific prompts that mentioned those emotions, we can evaluate whether the classification based on the closest cosine similarity is sufficiently accurate. The labelled version of the dataset can be downloaded here: emotionalTexts1Labelled.csv

dfLabelled = pd.read_csv("data/emotionalTexts1Labelled.csv")
dfLabelled.head()
                                                text   emotion
0  As I opened the door, the familiar scent of ci...  surprise
1  As the door slammed shut behind me, my fists c...     anger
2  Walking into the kitchen, I expected silence, ...  surprise
3  As I opened the gift, my eyes widened and a gr...       joy
4  As I turned the corner, I found my old college...  surprise

df["labelledEmotion"] = dfLabelled["emotion"]

Let’s compare predicted vs labelled emotions in a confusion matrix (the crosstab() function from the pandas package is very convenient for computing a confusion matrix, in a way similar to the table() function in R):

confusionMatrix = pd.crosstab(df["labelledEmotion"],
                              df["predictedEmotion"])
confusionMatrix
predictedEmotion  anger  disgust  fear  joy  sadness  surprise
labelledEmotion                                               
anger                 8       17    28    0        0         6
disgust               0       56     0    0        0         0
fear                  0        7    33    0        0        20
joy                   0        3     0   42        0        17
sadness               0       11     3    1       36         5
surprise              0        3     0    1        0        59
plt.figure(figsize=(9, 6))
sns.heatmap(confusionMatrix, annot=True, cmap='Blues')
plt.title("Predicted-Labelled Confusion Matrix", fontsize=15)
plt.xlabel("Predicted Emotion", fontsize=15)
plt.ylabel("Labelled Emotion", fontsize=15)
plt.xticks(fontsize=13);
plt.yticks(fontsize=13, rotation=0);
plt.show()

plt.clf(); plt.close()

What could you do next?

  • Try using different target sentences for the emotions (e.g., "I feel joy" instead of "a person feeling joy");
  • Try to group sentences based on other a priori defined target categories (e.g., non-emotional categories);
  • Try to expand the exercise to other datasets of texts/sentences that you may find around;
  • Replace the model with a larger one from HuggingFace and see if it works better.