# recommended if not already done
# !pip install sentence-transformers
# import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
Exercises – Basics of Text Classification using Embeddings and Cosine Similarity
Basics of Python for Data Science
Cosine similarity
Cosine similarity is conceptually similar to correlation, and it provides an index of how aligned two vectors are; it ranges in \([-1, +1]\). We can compute this similarity across sentence embeddings: the more similar a sentence embedding is to the embedding of a reference concept, the more likely it is that the sentence expresses that concept.
Scenario
We want to classify emotional sentences based on their similarity to a set of reference emotion words in {"joy", "sadness", "anger", "fear", "surprise", "disgust"}
. To do so, we compute the cosine similarity between the embedding of each sentence and the embeddings of the reference emotion words. The idea is that the reference emotion word whose embedding (thus, meaning) is most similar/aligned with a sentence embedding will present the highest cosine similarity to it.
Preliminary Steps
Import Dataset, Load Pretrained Model, Generate Sentence Embedding
You can download the dataset from: emotionalTexts1.csv
. The following steps are the same as in the previous exercise:
Import dataset
= pd.read_csv("data/emotionalTexts1.csv")
df df.head()
text
0 As I opened the door, the familiar scent of ci...
1 As the door slammed shut behind me, my fists c...
2 Walking into the kitchen, I expected silence, ...
3 As I opened the gift, my eyes widened and a gr...
4 As I turned the corner, I found my old college...
Load pretrained model (once again, let’s use "all-MiniLM-L6-v2"
from Sentence Transformers)
= SentenceTransformer("all-MiniLM-L6-v2") model
Generate Sentence Embeddings
= model.encode(df["text"])
sentenceEmbeddings
= pd.DataFrame(sentenceEmbeddings)
sentenceEmbeddings round(3) sentenceEmbeddings.
0 1 2 3 4 ... 379 380 381 382 383
0 -0.017 0.026 0.038 0.105 0.107 ... -0.098 0.136 0.031 -0.064 0.008
1 0.039 0.035 0.004 0.063 0.106 ... 0.015 0.100 0.007 -0.112 0.015
2 0.029 0.128 0.006 0.061 0.102 ... 0.030 0.076 0.047 -0.127 -0.047
3 -0.061 0.103 0.086 0.098 0.112 ... -0.010 0.013 0.033 -0.129 0.013
4 -0.077 0.029 0.063 0.035 0.025 ... -0.037 0.038 -0.015 -0.134 0.043
.. ... ... ... ... ... ... ... ... ... ... ...
351 0.032 0.028 0.013 0.066 0.060 ... -0.012 0.101 0.062 -0.091 -0.051
352 0.040 -0.038 0.057 0.072 0.062 ... 0.016 0.097 -0.029 -0.105 -0.014
353 0.050 0.012 0.023 0.069 0.113 ... 0.013 0.113 -0.002 -0.075 -0.009
354 0.035 -0.018 0.057 0.066 0.105 ... -0.051 0.080 0.117 -0.066 0.013
355 0.031 0.030 0.020 0.049 0.167 ... -0.007 0.076 0.059 -0.125 -0.040
[356 rows x 384 columns]
This gives us a matrix of shape (n_sentences, embedding_dim), where each row is a numerical representation of a sentence.
Also, Compute Embeddings for the “target” emotions
Our chosen target emotions / emotional categories are {"joy", "sadness", "anger", "fear", "surprise", "disgust"}
. However, to make them more interpretable and unambiguous in terms of the natural language meaning, and thus of the embeddings, let’s incorporate the target emotions into short sentences like {"a person feeling joy", "a person feeling sadness", "a person feeling anger", "a person feeling fear", "a person feeling surprise", "a person feeling disgust"}
. This is useful because text embedding models are specifically trained to capture meaning from sentences rather than isolated words:
= ["joy", "sadness", "anger", "fear", "surprise", "disgust"]
emotions = ["a person feeling "+e for e in emotions]
targetEmotions
= model.encode(targetEmotions)
emotionEmbeddings
= pd.DataFrame(emotionEmbeddings)
emotionEmbeddings round(3) emotionEmbeddings.
0 1 2 3 4 ... 379 380 381 382 383
0 0.017 0.026 -0.008 -0.039 -0.034 ... 0.037 0.011 -0.030 -0.014 0.040
1 0.047 0.031 -0.003 0.046 -0.004 ... 0.010 0.051 0.011 -0.031 0.031
2 0.014 -0.009 -0.020 0.025 -0.002 ... 0.014 0.084 -0.010 -0.081 0.022
3 0.056 -0.014 -0.011 0.052 0.045 ... -0.003 -0.004 -0.049 -0.035 0.073
4 -0.056 0.018 0.030 0.007 0.020 ... 0.011 0.050 -0.001 -0.095 0.002
5 0.043 0.006 0.044 0.020 0.014 ... 0.033 0.040 -0.002 -0.096 -0.027
[6 rows x 384 columns]
Compute Cosine Similarities
Now finally we can compute the cosine similarities! This allows us to compare each sentence (e.g., "Walking into the kitchen, I expected silence..."
) with the reference emotion “target” (e.g., "a person feeling joy"
)
= cosine_similarity(sentenceEmbeddings, emotionEmbeddings)
similarityMatrix
round(3) similarityMatrix.
array([[0.205, 0.181, 0.183, 0.258, 0.327, 0.277],
[0.237, 0.244, 0.264, 0.298, 0.292, 0.299],
[0.285, 0.272, 0.127, 0.217, 0.45 , 0.284],
...,
[0.076, 0.206, 0.171, 0.278, 0.234, 0.463],
[0.275, 0.301, 0.217, 0.205, 0.243, 0.262],
[0.275, 0.239, 0.191, 0.224, 0.379, 0.269]],
shape=(356, 6), dtype=float32)
Visualize as heatmap
Perhaps a heatmap will be more effective for reporting and understanding 🙂
=(8, 10))
plt.figure(figsize25,:], aspect='auto', cmap='Blues')
plt.imshow(similarityMatrix[:;
plt.colorbar()'Cosine Similarity Heatmap', fontsize=20)
plt.title('Sentence #', fontsize=18)
plt.ylabel(=range(0,len(emotions)),
plt.xticks(ticks=emotions, fontsize=18, rotation=45);
labels=range(0,25),fontsize=14);
plt.yticks(ticks plt.show()
; plt.close() plt.clf()
The above heatmap suggests that, for example, sentence #2 is clearly associated with surprise (sentence is: "Walking into the kitchen, I expected silence, but instead I was greeted by an unexpected choir of 'Happy Birthday!' with smiling faces peeking out from every corner. My eyes widened as I took in the sight of balloons and a cake that I had surely not planned for"
); while for example sentence #15 is clearly associated with joy (sentence is: The sun warmed my skin as I spun around in the open field, arms outstretched, feeling the breeze dance through my hair. Laughter bubbled up uncontrollably, as if it were the most natural sound in the world
)
Predict the Closest Emotion
Now let’s assign each sentence to the emotion with the highest similarity using the .argmax()
method
= similarityMatrix.argmax(axis=1)
closestEmotion "predictedEmotion"] = [emotions[i] for i in closestEmotion]
df[ df.head()
text predictedEmotion
0 As I opened the door, the familiar scent of ci... surprise
1 As the door slammed shut behind me, my fists c... disgust
2 Walking into the kitchen, I expected silence, ... surprise
3 As I opened the gift, my eyes widened and a gr... surprise
4 As I turned the corner, I found my old college... surprise
Question: Do the assigned emotions seem plausible, just by inspecting the text and predictions? Inspect a few of them!
Evaluate Against Ground Truth
Actually, there was a ground truth! 🙂 As I generated the sentences based on specific prompts that mentioned those emotions, we can evaluate whether the classification based on the closest cosine similarity is sufficiently accurate. The labelled version of the dataset can be downloaded here: emotionalTexts1Labelled.csv
= pd.read_csv("data/emotionalTexts1Labelled.csv")
dfLabelled dfLabelled.head()
text emotion
0 As I opened the door, the familiar scent of ci... surprise
1 As the door slammed shut behind me, my fists c... anger
2 Walking into the kitchen, I expected silence, ... surprise
3 As I opened the gift, my eyes widened and a gr... joy
4 As I turned the corner, I found my old college... surprise
"labelledEmotion"] = dfLabelled["emotion"] df[
Let’s compare predicted vs labelled emotions in a confusion matrix (the crosstab()
function from the pandas
package is very convenient for computing a confusion matrix, in a way similar to the table()
function in R):
= pd.crosstab(df["labelledEmotion"],
confusionMatrix "predictedEmotion"])
df[ confusionMatrix
predictedEmotion anger disgust fear joy sadness surprise
labelledEmotion
anger 8 17 28 0 0 6
disgust 0 56 0 0 0 0
fear 0 7 33 0 0 20
joy 0 3 0 42 0 17
sadness 0 11 3 1 36 5
surprise 0 3 0 1 0 59
=(9, 6))
plt.figure(figsize=True, cmap='Blues')
sns.heatmap(confusionMatrix, annot"Predicted-Labelled Confusion Matrix", fontsize=15)
plt.title("Predicted Emotion", fontsize=15)
plt.xlabel("Labelled Emotion", fontsize=15)
plt.ylabel(=13);
plt.xticks(fontsize=13, rotation=0);
plt.yticks(fontsize plt.show()
; plt.close() plt.clf()
What could you do next?
- Try using different target sentences for the emotions (e.g.,
"I feel joy"
instead of"a person feeling joy"
); - Try to group sentences based on other a priori defined target categories (e.g., non-emotional categories);
- Try to expand the exercise to other datasets of texts/sentences that you may find around;
- Replace the model with a larger one from HuggingFace and see if it works better.