Exercises – Additional: Using AI / LLM as research assistants in systematic reviews
Basics of Python for Data Science
Information/Data Extraction
The following series of code chunks were actually used to process several PDF documents in a review and extract relevant information based on a textual prompt using a language model (OpenAI GPT-4)
Import libraries
import os
import openai
import pandas as pd
import numpy as np
from PyPDF2 import PdfReader
import tiktoken
os
: for file management
openai
: OpenAI package, interface via API / cloud basedpandas
/numpy
: data manipulation, as we now know wellPyPDF2
: reading PDF contents
tiktoken
: estimating token length for OpenAI models
Authenticate with OpenAI API and load the text prompt
You need a valid billing plan, and a valid API key! The billing plan require registering a credit card. The expense is not much unless you process many hundreds of papers
= openai.OpenAI(api_key="sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
client = open("_Prompt AI.txt").read() prompt
This initializes the OpenAI client with your API key and loads a pre-written prompt (stored in _Prompt AI.txt
) that will be appended before each PDF’s full text.
Scan the current directory for all PDF files
= os.listdir()
allfilenames = [f for f in allfilenames if f.endswith(".pdf")]
filenames = [os.path.splitext(f)[0] for f in filenames] articles
Initializes a dataframe to hold the results
= pd.DataFrame({"filename":filenames,
result "article":articles,
"length":np.nan,
"output":""})
Iterate through each PDF and extract information using the language model
Below is the core loop:
- Saves progress in _results.csv
(just in case everything crashes) - Extracts text from the current PDF
- Computes the token count
- Sends the text to the OpenAI model (if within lower and upper length limits)
- Stores the model’s output
for i in range(len(result)):
try:
"_results.csv",index=False) # save intermediate results so far
result.to_csv(= PdfReader(result["filename"][i])
pdfreader = "\n".join(page.extract_text() or "" for page in pdfreader.pages).strip()
fulltext
# Estimate the token count for the full text
= len(tiktoken.encoding_for_model("gpt-4").encode(fulltext))
textlength "length"] = textlength
result.loc[i,
# Skip very short or very long documents
if textlength > 500 and textlength < 30000 and result.loc[i,"output"] == "":
# the true "core": creates a gpt chat and launches the prompt + full text
= client.chat.completions.create(
response = "gpt-4o-mini", # choose the model, mind the price!
model = [
messages "role":"system", "content": "You are a research assistant"},
{"role":"user", "content": prompt + fulltext}
{
],= 0.5
temperature
)# store output text (replacing newline for convenience)
"output"] = response.choices[0].message.content.replace("\n","___")
result.loc[i,# print some notification as iteration is completed
print("------------- Done at row " + str(i) +
". File: " + result["filename"][i] + ". Length: " + str(textlength))
except Exception as e:
print("------------- Error at row " + str(i))
Final: Just plot the distribution of token lengths across all PDFs
import matplotlib.pyplot as plt
import seaborn as sns
"length"])
sns.histplot(result[
plt.show(); plt.close()
plt.clf()
# checks the total token volume
"length"].sum() result[
What could you do now?
- Do nothing, just be aware that this is a possibility;
- Just give it a try and see how it performs;
- Rewrite this for a simpler but important task such as abstract screening;
- Re-implement the approach using a free and reproducible model (e.g., LLaMA, Mistral, SciBERT), but be aware that high quality models may be very large and not easily run locally on normal computers