Exercises – Other Examples of Language, Speech, and Image Processing

Basics of Python for Data Science

First of all, let’s import the pipeline() function from the transformers library. transformers provides access to several deep learning models for NLP and speech processing hosted on HuggingFace. The pipeline() function provides a very simple interface to access these models and apply them to common tasks (e.g., text generation, transcription of audio, predict missing word).

from transformers import pipeline

Text Generation

Transformer models like GPT-2 are pretrained to generate natural language based on linguistic prompts (we all know chatGPT!). Unlike newer GPT versions, GPT-2 has been made freely available via HuggingFace, and it can be used even without an openAI API and billing plan

All we have to do is initialize and use an instance of pipeline(), specifying the desired model ("gpt2") and the required task ("text-generation"):

textgenerator = pipeline(task="text-generation", model="gpt2")

text = textgenerator("Once upon a time in academia", max_length=150, temperature=1)

print(text)

[{'generated_text': 'Once upon a time in academia or, for that matter, in the United States, a national park, you are exposed to some kind of strange life that is not allowed under the law, and one of the things that is kind of surprising is that even if there is, you can still have people come, it is not a part of the general law.\n\nIn a state like Colorado, it is really shocking to have a state that is not subject to the law. The same way I am shocked that we have a lot of people who are actually out there with the law, that they are not protected by it, but are in some way doing something really horrible, they are going to go to the jail, or if that happens,'}]

The model continues the text following the initial prompt that you provided with a series of plausible (even though not necessarily coherent) sentences. In addition to the prompt, we have passed two arguments:

max_length, which defines the total number of tokens (including the input prompt) in the generated sequence;
temperature which controls the randomness of predictions: lower values make the output more deterministic and even repetitive, while higher values leads to more randomness and creativity, but at extreme values may lead to totally incoherent outputs (temperature must be greater than 0 and smaller than 2).

Audio Speech Transcription

Another very useful application is automatic speech recognition and transcription. In the following chunk we use a whisper model to transcribe a short speech from an .mp3 file

The specific model we use here ("openai/whisper-small"), which is quite lightweight, can be found on HuggingFace at this link: https://huggingface.co/openai/whisper-small. Larger and more powerful versions, like "openai/whisper-medium" and "openai/whisper-large" are also available

whisperpipe = pipeline(model="openai/whisper-small")

transcription = whisperpipe("data/gelmanstats.mp3", return_timestamps=True)

print(transcription['text'])

 Hi, my name is Andrew Gelman. I'm a professor of statistics and political science at Columbia. I've been teaching here since 1996. I'm always impressed that the students know more than I do about so many things. Just this morning, a graduate student came to me. He had been looking at data from ticketing, police moving violations. And there's this suspicion that the police have some sort of unofficial quota so that they're supposed to get a certain number of tickets by the end of the month. But you can look at the data and see whether you have more tickets at the very end of the month. He looked at something slightly different. He compared police that had had fewer tickets in the first half of the month to those who had more tickets and found that the ones who had fewer in the first half had a big jump in the second half of the month, which he attributed to a policy where the police don't have a quote of, but they do it comparatively. So if you've been ticketing fewer than other people in your precinct, then they hassle you and tell you you're supposed to do more. But we're still not quite sure. He was just telling me about this today. It's complicated because there could just be a variation. It could just be that the police who had more tickets in the first half have fewer in the second half just because it goes up and it goes down. and you happen to see that. So we have to do more, look at the data in different ways in order to try to understand, like to rule out different explanations for what's happening. I think that at least well it depends on what field you're working on. So if if someone's working, we have a student working on astronomy and there we're working with the astronomy professor David Shimonovich and it's It's very, it's necessary for us to combine our skills. So we have skills in visualizing data and fitting models and computation, but it's not like David Shimano, which gives us the data and we fit it. We have to work together in a collaboration. So that's sort of always the case, that we have to have a bridge connecting the applications to the modeling and the data. I think all the projects I've told you so far are exciting and interesting, and we're We're doing other things too. We have a paper called 19 Things We Learned From The 2016 Election. We're breaking down survey data, looking at how people voted, how young people and old people voted, comparing the gender gap among younger and older voters and among different ethnic groups. That's very challenging. Even if you have a big survey where you want to estimate small subgroups of the population, it requires some statistical modeling. Well, I think the polls were pretty good. They had Hillary Clinton ahead by getting about 52% of the two-party vote, and she actually got 51%, so that wasn't too bad. Polls are often some key states. I think that had to do with non-response, that people, certain Republicans who are not responding to polls in certain states. The way we can go forward is to do more adjustment of surveys. So it's harder and harder to reach people, so more and more adjustment needs to be done, but maybe surveys need to be adjusted also based on whether people are living in urban or rural areas and their partisanship and some other things. So one is never going to do perfect, but there's potential for improvement.

⚠️ Download the audio file gelmanstats.mp3, and then make sure that it can be correctly accessed from the working directory.
return_timestamps=True is required for relatively long audios, but it can be set to False for shorter audios for a simpler output.

If you prefer not to download and run the whole whisper model locally, you could use the OpenAI API (this effectively transfers the task on the cloud). But note that to do so you need a valid billing plan with OpenAI and a valid API key (also, the following code may no longer work if OpenAI changes the pipeline interface, which happens quite often):

from openai import OpenAI

client = OpenAI(api_key="sk-proj-4slS9XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")

audio_file = open("data/gelmanstats.mp3", "rb")

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file = audio_file
)
print(transcript)

Predict Missing Word

Models like BERT (Bidirectional Encoder Representations from Transformers) are trained to predict sentences and tokens that follow in a text. An interest feature is that they allow to to predict a missing “masked” word in a chunk of text, based on the surrounding context. This can be adapted for tasks like establishing the most expected word in a context, correcting typos or clarifying the meaning of ambiguous sentences given the context.

The specific model we use here can be found on HuggingFace at this link: https://huggingface.co/google-bert/bert-base-uncased

findmaskedword = pipeline(task="fill-mask", model="bert-base-uncased")

findmaskedword("My professor is [MASK].")

[{'score': 0.1994817852973938,
  'token': 2757,
  'token_str': 'dead',
  'sequence': 'my professor is dead.'},
 {'score': 0.16430267691612244,
  'token': 2182,
  'token_str': 'here',
  'sequence': 'my professor is here.'},
 {'score': 0.09633355587720871,
  'token': 2157,
  'token_str': 'right',
  'sequence': 'my professor is right.'},
 {'score': 0.07279013842344284,
  'token': 2908,
  'token_str': 'gone',
  'sequence': 'my professor is gone.'},
 {'score': 0.020747046917676926,
  'token': 2045,
  'token_str': 'there',
  'sequence': 'my professor is there.'}]

We get the top 5 most likely words for the masked token. To demonstrate how the likelihood of the words change with the context, let’s slightly modify the sentence:

findmaskedword("My professor is [MASK], so don't worry.")

[{'score': 0.33167019486427307,
  'token': 2182,
  'token_str': 'here',
  'sequence': "my professor is here, so don ' t worry."},
 {'score': 0.08260247111320496,
  'token': 2157,
  'token_str': 'right',
  'sequence': "my professor is right, so don ' t worry."},
 {'score': 0.04793822020292282,
  'token': 5697,
  'token_str': 'busy',
  'sequence': "my professor is busy, so don ' t worry."},
 {'score': 0.04354258254170418,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "my professor is fine, so don ' t worry."},
 {'score': 0.03748282417654991,
  'token': 2045,
  'token_str': 'there',
  'sequence': "my professor is there, so don ' t worry."}]

Image Classification

Let’s try a simple example of image classification using a pretrained computer vision model from HuggingFace. The chosen model ("microsoft/resnet-18")is not very powerful, but it is small and very lightweight (for more powerful but also heavier model, try"microsoft/resnet-152", "google/efficientnet-b4" or "google/efficientnet-b7"). The task is to predict the most likely category for the image below

from PIL import Image
myImage = Image.open("data/figureexample.png")

imageclassifier = pipeline(task="image-classification", model="microsoft/resnet-18")

predictions = imageclassifier(myImage)
print(predictions)

[{'label': 'water snake', 'score': 0.2936452627182007}, {'label': 'Indian cobra, Naja naja', 'score': 0.28091076016426086}, {'label': 'ringneck snake, ring-necked snake, ring snake', 'score': 0.16295978426933289}, {'label': 'thunder snake, worm snake, Carphophis amoenus', 'score': 0.12134847044944763}, {'label': 'hognose snake, puff adder, sand viper', 'score': 0.05200718343257904}]

What could you do next?

Play a bit with text generation and change the prompt (e.g., "In the far future,", "Brilliant psychologists discovered that")
Vary arguments like max_length, num_return_sequences, or temperature and see what happens to text generation.
Try to run image classification using pictures with different styles (e.g., photographs, drawings) and see the performance of the model.
If you have a few pictures of different categories (e.g., animate objects vs inanimate objects), try to run a look, classify all of them, and then use another model for text classification to categorize the textual outputs into these two (or more) a priori categories.
Have a look at all task options that are available using transformers pipelines, at this link: https://huggingface.co/docs/transformers/main_classes/pipelines (e.g., try text-classification, "zero-shot-classification", translation using the "Helsinki-NLP/opus-mt-en-it" model for English-to-Italian translation).