GPT4-Summary for text via API

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[1]:

Last updated: Mar-16-2023, Carto-Lab Docker Version 0.12.3

This notebook is leaned on Summarizing Papers With Python and GPT-3.

ChatGPT can be used to summarize text. The above post shows how to do this for research papers. I am testing this here, with some modifications of the code, on this recent research paper:

Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation

DOI: 10.3390/ijgi9100607

GPT-4 - Version tested on March 15 with a number of papers.

Prepare environment

To run this notebook, as a starting point, you can use the Carto-Lab Docker Container.

conda activate worker_env
conda install -c conda-forge openai pdfplumber
In [2]:
import requests
import pdfplumber
import openai
from pathlib import Path

Get file

In [3]:
def get_pdf_url(url, filename="random_paper.pdf"):
    """
    Get PDF from url
    """
    filepath = Path(filename)
    if not filepath.exists():
        response = requests.get(url, stream=True)
        filepath.write_bytes(response.content)
    return filepath
In [4]:
url: str = "https://www.mdpi.com/2220-9964/9/10/607/pdf?version=1605175173"
paper_name: str = "privacy_paper.pdf"
In [5]:
pdf_path = get_pdf_url(url, paper_name)
print(pdf_path)
privacy_paper.pdf

Convert PDF to text

In [6]:
paper_content = pdfplumber.open(pdf_path).pages

def display_page_content(paper_content, page_start=0, page_end=5):
    for page in paper_content[page_start:page_end]:
        print(page.extract_text(x_tolerance=1))
display_page_content(paper_content)

Use OpenAI API to summarize content

Load API key

In [7]:
import os
from pathlib import Path
from dotenv import load_dotenv

dotenv_path = Path.cwd().parent / '.env'
load_dotenv(dotenv_path, override=True)
API_KEY = os.getenv("OPENAI_API_KEY")
ORGANISATION = os.getenv("OPENAI_ORGANIZATION")
In [8]:
openai.organization = ORGANISATION
openai.api_key = API_KEY

Test

openai.Engine.list()
In [9]:
len(paper_content)
Out[9]:
20

This is based on the OpenAI Example "Summarize Text". Also see the API Reference.

In [14]:
import warnings

def limit_tokens(str_text, limit: int = 2000) -> str:
    """Limit the number of words in a text"""
    wordlist = str_text.split()
    if len(wordlist) > limit:
        warnings.warn(
            f'Clipped {len(wordlist)-limit} words due to token length limit of {limit}.')
    return ' '.join(wordlist[:limit])

def pdf_summary(paper_content, page_start=0, page_end=5):
    engine_list = openai.Engine.list()
    text = ""
    for ix, page in enumerate(paper_content[page_start:page_end]):
        text = f"{text}\n{page.extract_text(x_tolerance=1)}"
    text = limit_tokens(text)
    task = f"{text}\n\nTl;dr"
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=task,
        temperature=0.7,
        max_tokens=500,
        top_p=1,
        presence_penalty=1,
        frequency_penalty=0.7,
        stop=["\nthe_end"]
    )
    return response
  • max_tokens: The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).
  • temperature: What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
  • presence_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
  • frequency_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
In [15]:
def get_pdf_summary(url, filename="random_paper.pdf", page_start=0, page_end=5):
    """Get PDF, if it doesn't exist locally, and summarize"""
    pdf_path = get_pdf_url(url, filename)
    paper_content = pdfplumber.open(pdf_path).pages
    answer = pdf_summary(paper_content, page_start=page_start, page_end=page_end)
    print(answer["choices"][0]["text"])
In [16]:
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
/tmp/ipykernel_666/812843138.py:7: UserWarning: Clipped 9940 words due to token length limit of 2000.
  warnings.warn(

This paper presents a component-based approach to privacy-aware visualization of volunteered geographic information (VGI) for natural resource management. A key component is HyperLogLog (HLL), which is used to allow estimation of results, instead of more accurate measurements. HLL can be combined with existing approaches to improve privacy while also maintaining some flexibility in the analysis. Both the data processing pipeline and resulting dataset are made available, allowing transparent benchmarking of the privacy–utility tradeoffs. The paper provides an example use case demonstration based on a global, publicly-available dataset that contains 100 million photos shared by 581,099 users under Creative Commons licenses, illustrating how HLL may fill a gap in privacy-aware processing of user-generated data in natural resource management.

Test a range of pages in thre middle of the paper:

In [17]:
get_pdf_summary(url, paper_name, page_start=4, page_end=8)
/tmp/ipykernel_666/812843138.py:7: UserWarning: Clipped 749 words due to token length limit of 2000.
  warnings.warn(

We present a system architecture for privacy-aware visual analytics of social media data, using HyperLogLog (HLL) to count distinct items. We combine four Docker containers to simulate the roles of an Analytics Service, such as an Aggregation Service and Sketching Service. Using the Yahoo Flickr Creative Commons 100 Million dataset as a demonstration example, we compare results between raw and HLL data and discuss two case studies in terms of internal and external adversaries. Benchmark results demonstrate that HLL provides similar performance compared to raw data for most metrics but with improved protection against privacy attacks.

Test with different papers

Test on Marc's paper:

In [19]:
url = "https://www.mdpi.com/2220-9964/12/2/60/pdf?version=1676279274"
paper_name = "loechner_privacy.pdf"
In [20]:
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
/tmp/ipykernel_666/812843138.py:7: UserWarning: Clipped 6655 words due to token length limit of 2000.
  warnings.warn(

This paper discusses the use of a method called HyperLogLog to store social media data in order to protect the privacy of its users. The technique is based on using cardinality estimation, which allows for unions and intersections on multiple datasets without becoming in possession of the actual raw data. A proof-of-concept implementation for this method is provided with an example disaster management scenario. This new form of data storage can be used to mitigate risks associated with large sets of personal data, such as abuse, loss or public exposure.
In [21]:
url = "https://cartogis.org/docs/autocarto/2022/docs/abstracts/Session9_Burghardt_0754.pdf"
paper_name = "dirk_ethics.pdf"
In [22]:
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
/tmp/ipykernel_666/812843138.py:7: UserWarning: Clipped 1042 words due to token length limit of 2000.
  warnings.warn(

This paper presents an ethical analysis of geosocial data to balance social and individual interests. It proposes the use of HyperLogLog, an algorithm that can break up geo-social media posts into quantitative, statistical information units called HLL sets. These HLL sets are heavily compressed and allow very high-performance queries over large amounts of data. To adjust privacy–utility tradeoffs, stop and allow lists as well as threshold values can be used during the creation of the HLL set to enable context-dependent data protection through filtering. The paper also discusses how different types of contexts (spatial, temporal, thematic and social) can be treated differently in order to protect user privacy while allowing flexibility for different applications.
In [23]:
url = os.getenv("M_PAPER_URL")
paper_name = "madalina_ecosystemservices.pdf"
In [24]:
%%time
get_pdf_summary(url, paper_name, page_start=0, page_end=20)
/tmp/ipykernel_666/812843138.py:7: UserWarning: Clipped 11728 words due to token length limit of 2000.
  warnings.warn(
: This study introduces a novel method for assessing the cultural ecosystem services (CES) provided by urban green spaces. The method draws on the semantic similarity between word2vec word embeddings to classify large volumes of geosocial media textual metadata and quantify indicators of CES use. We demonstrated the applicability of our approach by quantifying spatial patterns of aesthetic appreciation and wildlife recreation in the green spaces of Dresden, Germany based on >50,000 geotagged Instagram and Flickr posts. Additionally, we analyzed and mapped semantic patterns embedded in geosocial media data which can contribute toward a context-dependent assessment of CES use, helping inform decision making for more sustainable planning and management of urban ecosystems.
CPU times: user 6.27 s, sys: 151 ms, total: 6.42 s
Wall time: 13 s

Conclusions

ChatGPT is best with responses > 256 tokens, but it is limited to processing 4096 tokens at once, which is not enough to read the full paper in context. Still, this seems like a good way to get a quick summary when skimming through many research papers.

Create notebook HTML

In [2]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./gpt4-summary.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]: