Alexander Dunkel, Institute of Cartography, TU Dresden
This notebook is leaned on Summarizing Papers With Python and GPT-3.
ChatGPT can be used to summarize text. The above post shows how to do this for research papers. I am testing this here, with some modifications of the code, on this recent research paper:
Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation
GPT-4 - Version tested on March 15 with a number of papers.
To run this notebook, as a starting point, you can use the Carto-Lab Docker Container.
conda activate worker_env
conda install -c conda-forge openai pdfplumber
import requests
import pdfplumber
import openai
from pathlib import Path
def get_pdf_url(url, filename="random_paper.pdf"):
"""
Get PDF from url
"""
filepath = Path(filename)
if not filepath.exists():
response = requests.get(url, stream=True)
filepath.write_bytes(response.content)
return filepath
url: str = "https://www.mdpi.com/2220-9964/9/10/607/pdf?version=1605175173"
paper_name: str = "privacy_paper.pdf"
pdf_path = get_pdf_url(url, paper_name)
print(pdf_path)
paper_content = pdfplumber.open(pdf_path).pages
def display_page_content(paper_content, page_start=0, page_end=5):
for page in paper_content[page_start:page_end]:
print(page.extract_text(x_tolerance=1))
Load API key
import os
from pathlib import Path
from dotenv import load_dotenv
dotenv_path = Path.cwd().parent / '.env'
load_dotenv(dotenv_path, override=True)
API_KEY = os.getenv("OPENAI_API_KEY")
ORGANISATION = os.getenv("OPENAI_ORGANIZATION")
openai.organization = ORGANISATION
openai.api_key = API_KEY
Test
len(paper_content)
This is based on the OpenAI Example "Summarize Text". Also see the API Reference.
import warnings
def limit_tokens(str_text, limit: int = 2000) -> str:
"""Limit the number of words in a text"""
wordlist = str_text.split()
if len(wordlist) > limit:
warnings.warn(
f'Clipped {len(wordlist)-limit} words due to token length limit of {limit}.')
return ' '.join(wordlist[:limit])
def pdf_summary(paper_content, page_start=0, page_end=5):
engine_list = openai.Engine.list()
text = ""
for ix, page in enumerate(paper_content[page_start:page_end]):
text = f"{text}\n{page.extract_text(x_tolerance=1)}"
text = limit_tokens(text)
task = f"{text}\n\nTl;dr"
response = openai.Completion.create(
engine="text-davinci-003",
prompt=task,
temperature=0.7,
max_tokens=500,
top_p=1,
presence_penalty=1,
frequency_penalty=0.7,
stop=["\nthe_end"]
)
return response
def get_pdf_summary(url, filename="random_paper.pdf", page_start=0, page_end=5):
"""Get PDF, if it doesn't exist locally, and summarize"""
pdf_path = get_pdf_url(url, filename)
paper_content = pdfplumber.open(pdf_path).pages
answer = pdf_summary(paper_content, page_start=page_start, page_end=page_end)
print(answer["choices"][0]["text"])
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
Test a range of pages in thre middle of the paper:
get_pdf_summary(url, paper_name, page_start=4, page_end=8)
Test on Marc's paper:
url = "https://www.mdpi.com/2220-9964/12/2/60/pdf?version=1676279274"
paper_name = "loechner_privacy.pdf"
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
url = "https://cartogis.org/docs/autocarto/2022/docs/abstracts/Session9_Burghardt_0754.pdf"
paper_name = "dirk_ethics.pdf"
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
url = os.getenv("M_PAPER_URL")
paper_name = "madalina_ecosystemservices.pdf"
%%time
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
ChatGPT is best with responses > 256 tokens, but it is limited to processing 4096 tokens at once, which is not enough to read the full paper in context. Still, this seems like a good way to get a quick summary when skimming through many research papers.
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./gpt4-summary.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-