# recommended if not already done
# !pip install transformers
# !pip install torch
# !pip install hf_xet
# import needed packages and functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import transformers
from torch.nn.functional import softmax
Exercises - Basics of Sentiment Analysis with Hugging Face Transformers
Basics of Python for Data Science
Scenario
We will work with a small dataset of Amazon customer reviews of a hair/beard trimmer.
To analyze the text, we will use a pretrained sentiment analysis model from HuggingFace. The model is based on RoBERTa (Robustly Optimized BERT [Bidirectional Encoder Representations from Transformers] pretraining Approach). Specifically, we will use a model called ("cardiffnlp/twitter-roberta-base-sentiment"
) which was trained and fine-tuned on Twitter text. I like it because it is quite lightweight and it returns 3 separate sentiment classes: Negative, Neutral, Positive.
(The chosen hair/beard trimmer product is actually reviewed pretty well overall, but I have deliberately sampled similar numbers of reviews for each level of customer ratings/stars ⭐ to ensure more variability… so the dataset is not actually representative of the actual product ratings)
We will: - Run a pretrained model to perform sentiment analysis on the review texts; - Assess prediction accuracy (i.e., how well the model prediction actually matches the customer stars ⭐); - Plot the relationship between model predictions and customer ratings.
Preliminary Steps
Import and Inspect Dataset
You can download the dataset from: panasonicTrimmer.xlsx
It contains customer reviews of a Panasonic hair/beard trimmer sold on Amazon, each associated with a short "Summary"
description, a more extensive "Text"
review, the number of "Stars"
rated by the customer (1–5), also manually annotated in a columns names "trueSentiment"
(1-2 stars: NEGATIVE
; 3 stars: NEUTRAL
; 4-5 stars: POSITIVE
). Let’s inspect it first:
= pd.read_excel("data/panasonicTrimmer.xlsx")
df df.shape
(60, 4)
df.columns
Index(['Summary', 'Text', 'Stars', 'trueSentiment'], dtype='object')
Load the Pretrained Sentiment Model
All we need to do is to create a sentiment analysis pipeline specifying the model
name and the task
that it is required to do. The model is hosted on HuggingFace (we must specify the exact "Model Name"
), and it is automatically accessed by transformers
.
= transformers.pipeline(task="sentiment-analysis",
classifier ="cardiffnlp/twitter-roberta-base-sentiment") model
Run the Sentiment Analysis Task for each Review
Just for fun and practice, let’s try running the classifier pipeline on a single line of text:
"Wow, this product is great! :-D") classifier(
[{'label': 'LABEL_2', 'score': 0.9924255609512329}]
Ok, if you try a bit you will realize that for any piece of text it returns two elements:
"label"
which is the predicted sentiment class, it is not directly interpretable, but like:"LABEL_0"
→"NEGATIVE"
;"LABEL_1"
→"NEUTRAL"
;"LABEL_2"
→"POSITIVE"
(we need to keep this in mind and convert those later for interpretability);
"score"
a confidence value between 0 and 1 indicating the model certainty.
Now, let’s run the classifier
pipeline on every single cell on the "Text"
column of our dataframe, and store the results in a list (using list comprehension)
= [classifier(t)[0] for t in df["Text"]] pred
Let’s convert the result into a DataFrame for easier manipulation:
= pd.DataFrame(pred) pred
Now, for ease of “human” readability, let’s convert the labels using the np.select()
function:
= [pred["label"]=="LABEL_0", pred["label"]=="LABEL_1", pred["label"]=="LABEL_2"]
conditions = ["NEGATIVE", "NEUTRAL", "POSITIVE"]
labels "predictedSentiment"] = np.select(conditions, labels, default="") df[
Evaluate Predictions
To assess how well the model is performing, we can compare the predictions (predictedSentiment
) to the ground-truth labels in our dataset (trueSentiment
). To do so, we can compute a confusion matrix: it will tell us how many "POSITIVE"
reviews as predicted by the model actually corresponds to ratings of 4-5 stars by the customer, and so on.
We can use the confusion_matrix()
function from the sklearn.metrics
submodule, which is analogous to the table()
function in R
= confusion_matrix(df["trueSentiment"], df["predictedSentiment"], labels=labels)
CM print(CM)
[[17 1 2]
[10 7 3]
[ 1 1 18]]
For readability, we can transform the counts into proportions (0 to 1, row-wise):
print(CM / CM.sum(1))
[[0.85 0.05 0.1 ]
[0.5 0.35 0.15]
[0.05 0.05 0.9 ]]
Question: which “sentiment” is more easily confused? why?
From Classification to Continuous Sentiment Valence
I actually don’t like categorical classification very much. I would prefer to obtain a continuum score of valence where one extreme is absolutely negative, and the other is absolutely positive. In fact, our model was trained to assign discrete labels. However, it generates underlying logit scores that reflect its confidence for each sentiment class. We can use those as proxies to compute a single continuous valence score.
- First, let’s tokenize the sentence. Now, this is a bit technical…
= transformers.AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
tokenizer
= tokenizer("My day was so-so, but afterall not so bad", return_tensors="pt")
tokenSentence print(tokenSentence)
{'input_ids': tensor([[ 0, 2387, 183, 21, 98, 12, 2527, 6, 53, 71, 1250, 45,
98, 1099, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
- Second, let’s run the model on the tensor of tokens and extract the logits (the first score is for negative, the second for neutral, the third for positive):
= transformers.AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model
= model(input_ids=tokenSentence["input_ids"],
logitScores =tokenSentence["attention_mask"]).logits
attention_maskprint(logitScores)
tensor([[ 0.4030, 0.2260, -0.6003]], grad_fn=<AddmmBackward0>)
- Third, let’s convert logits to probabilities. A logit score can directly be converted into probability; more logit scores can be normalized to probabilities that cumulatively add to one. We can use the
softmax()
function (from thetorch.nn.functional
submodule;dim=1
tells the function that it must be computed row-wise):
note: softmax does like: \(\frac{e^{0.4030}}{e^{0.4030} + e^{0.2260} + e^{-0.6003}} = 0.4536\)
= softmax(logitScores, dim=1)
probs print(probs)
tensor([[0.4536, 0.3800, 0.1663]], grad_fn=<SoftmaxBackward0>)
- Finally, to summarize valence as one single score, we could multiply negative -1 times, neutral 0 times, positive +1 times, so the final score will be in a continuous range of [-1, 1]:
0][0]*(-1) + probs[0][1] * 0 + probs[0][2] * (+1) probs[
tensor(-0.2873, grad_fn=<AddBackward0>)
Good! Now let’s compute all of this for every customer review:
"valence"] = np.nan
df[for i in range(len(df["Text"])):
= tokenizer(df["Text"][i], return_tensors="pt") # encode sentence
sentence = model(**sentence).logits # forward pass
logitScores = softmax(logitScores, dim=1) # convert to probabilities
probs "valence"] = float(probs[0][0]*(-1) + # weighted sum: neg = -1
df.loc[i, 0][1]*0 + # neutral = 0
probs[0][2]*(+1)) # pos = +1
probs[
df.head()
Summary ... valence
0 Good trimmer, waterproof ... 0.909970
1 Quality hair trimmer ... 0.651900
2 One word - Brilliant! ... 0.980573
3 Excellent, at half the price! ... 0.896451
4 Ideal Purchase ... 0.958984
[5 rows x 6 columns]
Scatter Plot of predicted valence
vs customers’ ratings stars
=(12,6))
plt.figure(figsize=df["Stars"], y=df["valence"], alpha=0.4, s=120, color="darkgreen")
sns.scatterplot(x"Model Valence vs User Rating", fontsize=20)
plt.title("User Rating (Stars)", fontsize=20)
plt.xlabel("Sentiment Valence (from Model)", fontsize=20)
plt.ylabel(=16);
plt.xticks(fontsize=16);
plt.yticks(fontsize plt.show()
; plt.close() plt.clf()
Is there a clear relationship between model valence and actual star ratings?
Fit a Logistic Regression
Finally, let’s plot a regression line. But we must be careful: since valence
is bounded, the relation should not be modeled as linear. Rather, it could be logistic curve. For convenience, let’s first rescale valence from [-1, 1] to [0, 1], and then run a seaborn
regplot
with a logistic
regression:
# rescale valence from [−1, +1] → [0, 1] for logistic model
"valenceRescaled"] = (df["valence"] + 1) / 2
df[
# fit and plot logistic regression: stars → valence
=(12,6))
plt.figure(figsize=df["Stars"], y=df["valenceRescaled"], logistic=True,
sns.regplot(x={"alpha": 0.4, "s":120}, color="darkgreen")
scatter_kws"Rescaled Valence ~ Logistic Regression on Stars", fontsize=20)
plt.title("User Rating (Stars)", fontsize=20)
plt.xlabel("Sentiment Valence (from Model)", fontsize=20)
plt.ylabel(=16);
plt.xticks(fontsize=16);
plt.yticks(fontsize plt.show()
; plt.close() plt.clf()
What could you do now?
- Try to rerun all of the above, but first combine the
"Summary"
column with the more extensive"Text"
review column (this should improve classification ability even further, because it adds very dense information); - Try other sentiment models (e.g. search
"sentiment"
on HuggingFace); - Find out mismatches: try the above pipeline with different types of text and find out where the model possibly disagree strongly with human labels;
- Translate the reviews and test if the model still works on non-English texts, then find best models for non-English text.
As said above, this model was trained on Twitter texts. Domain shift (e.g., Amazon reviews) can make the performance suboptimal