Automated summarisation of PDFs with GPT API in python

Learn how to automate the summarization of PDFs using GPT API in Python in this informative article with step-by-step instructions. The article includes a detailed explanation of the code used and the results obtained from the summarization process.

Home » Info Articles » Automated summarisation of PDFs with GPT API in python

Background

Of course, jump ahead if you have had too much on what GPT can do.

The GPT (Generative Pre-trained Transformer) is a powerful language model developed by OpenAI, which has revolutionised the field of natural language processing (NLP). It is a neural network-based system that has been pre-trained on a massive corpus of text data and is capable of generating high-quality text content, including articles, summaries, and responses to queries.

The GPT API was first released in 2018 and has since undergone several improvements and updates, including the latest version, GPT-3, which was released in June 2020. This version has 175 billion parameters, making it one of the largest and most powerful language models ever created. It has been trained on a diverse range of text sources, including books, articles, and web pages, and has been shown to outperform previous language models on a variety of NLP tasks.

One of the most exciting features of the GPT API is its ability to generate coherent and concise summaries of long texts. This capability has numerous applications, including in the field of document analysis, where it can be used quickly and efficiently to summarise large volumes of text data.

Objective

In this project, we will explore how to leverage the power of the GPT API in Python to automate the summarisation of PDF documents, providing a useful tool for researchers, analysts, and other professionals who work with large amounts of textual data.

High Level Approach

Three simple high level steps only:

Fetch a sample document from internet / create one by saving a word document as PDF.
Use Pythons PyPDF2 library to extract text
Call GPT API to summarise with an appropriate prompt (e.g. summarise for a 5 year old, e.g. top 5 main themes, etc.)

Photo by Bruno Yamazaky on Unsplash

Step-by-Step

1.0 Downloading a sample PDF

To create a simple but realistic example here is an article published by UBS Wealth Management called “Let’s chat about ChatGPT”. This link has the article published as well as the option to download PDF at the bottom.

Reading the document takes 3–5 minutes as it is not a scientific document but a view point. Do read the document to later see if the summary generated by GPT is in line with your expectations.

2.0 Extract the text using PyPDF2 library

2.1 Install PyPDF2

The “free and open source pure python” PDF library PyPDF2 can be installed by simply calling pip install.

- PyPDF2 documentation
- Pip install + Usage

2.2 Write function to extract the text from PDF

from PyPDF2 import PdfReader

# This function is reading PDF from the start page to final page
# given as input (if less pages exist, then it reads till this last page)
def get_pdf_text(document_path, start_page=1, final_page=999):
    reader = PdfReader(document_path)
    number_of_pages = len(reader.pages)

    for page_num in range(start_page - 1, min(number_of_pages, final_page)):
        page += reader.pages[page_num].extract_text()
    return page

3.0 Invoke GPT API to get the summarisation

3.1 Get the GPT API Key

1. Log to OpenAI platform with your account.
2. Click “Create new secret key”
3. Copy the key by clicking on copy button
4. Set the environment variable (Open Terminal Window on Mac & type the following command)

export OPENAI_API_KEY=key_copied_from_openai_site

3.2 Write function to call the API

- Link to other official OpenAI Examples
- Official openai.Completion.create parameter description

Hint: Play around with hyper-parameters generates different response. Temperature is a specially fun parameter to tinker with.

import os
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')
def gpt_req_res(subject_text='write an essay on any subject.',
                prompt_base='answer like an experienced consultant: ',
                model='text-davinci-003',
                max_tokens=1200,
                temperature=0.8):

    # https://platform.openai.com/docs/api-reference/completions/create
    response = openai.Completion.create(
        model=model,
        prompt=prompt_base + ': ' + subject_text,
        temperature=temperature,
        max_tokens=1200,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response.choices[0].text

3.3 Create the main function to extract text from the PDF and call the GPT API with the extracted text

    doc_path_name = 'documents/chat_gpt_ubs.pdf'
    doc_text = get_pdf_text(doc_path_name, 1, 2)
    # print(doc_text)
    prompt = 'summarize like an experienced consultant in 5 bullets: '
    reply = gpt_req_res(doc_text, prompt)
    print(reply)

And it’s done ….

….Wait…. let’s check the output

Run-1

ChatGPT-3 is a chatbot developed by OpenAI, a US-based artificial intelligence research lab.
We view ChatGPT-3 as the current leader in a fast growing market that will see significant investment and development by leading, large technology companies globally.
Large language models are compelling because of their flexibility and can be used in a variety of applications across multiple markets.
We believe artificial intelligence (AI) will ultimately be additive to employment and economic growth.
The broad AI hardware and services market are expected to reach USD 90bn by 2025, with ChatGPT’s addressable market estimated at USD 18–20bn.

Run-2

ChatGPT-3 is a chatbot developed by OpenAI, a US-based artificial intelligence research lab. It uses a generative pre-trained transformer (GPT) to generate text.
ChatGPT-3 has potential use cases including chatbots for customer service and mental health support, personal assistants, content creation, language translation, knowledge management, and education/training.
AI is expected to be additive to employment and economic growth.
The AI hardware and services market was nearly USD 36bn in 2020 and is expected to grow to USD 90bn by 2025.
Investors can consider opportunities in public equities such as semiconductor companies, and cloud-service providers, and private equity (PE).

Very impressive summarisation especially extraction of sentences that can be actionable or worth further discussions.

Photo by Jason Leung on Unsplash

Summary

We can use this method to very quickly we can get summaries from PDFs
It can scaled to run over a large number of PDF and store data and other meta information in a searchable database

Limitations

The ‘free’ version of the API can process limited size text only. If you provide a long text openai throws an error: ‘openai.error.InvalidRequestError: This model’s maximum context length is 4097 tokens, however you requested 6327 tokens (5127 in your prompt; 1200 for the completion). Please reduce your prompt; or completion length.’

Things to think about

How to perform the same task when we have sensitive data and don’t want to it to be done over platform like OpenAI i.e. what are the on-premise solutions that are available?
What would a good searchable repository of information look like?

Outro

Watch out for more articles related to the first point. Evaluation of non-API based options to perform the same task.
Github Link

Back