Document summarizer using Open AI on LangChain

For the sake of a use case, the intention of this example is to summarize a resume. Google Colab was used for this experiment but you can use your own IDE/environment. Just make sure you have the necessary prerequicites set.

  1. Since I am using Google Colab, I will be uploading the sample input file to the “Files” store. You can choose to use your local disk storage if you are on a laptop/pc.
  2. While you can use any file format, I am using a pdf file as input so we have to convert the pdf to readable text. I will be using pdfx library to read and extract text data.
  3. A meaningful prompt and setting context will be done
  4. Access Open API API
  5. Receive response and show the summarized text.

Below are the instructions and code:

  1. Install Prerequisites

I am using pdfx library to read the pdf document. You can use any provider here.

pip install pdfx

We use OpenAI using LangChain so install the required dependencies

pip install --upgrade langchain langchain-openai tiktoken

2. Load Job Description (JD)

import pdfx
pdf = pdfx.PDFx('sample_data/Sample Resume.pdf')

resume_content = pdf.get_text();

3. Make the resume content compatible for LLM Chain

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
model_name = "gpt-3.5-turbo"
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    model_name=model_name
)

# Caution: This code doesn't bother about large documents so chunking/tokenization is out of scope of this example
texts = text_splitter.split_text(resume_content)

docs = [Document(page_content=t) for t in texts]

4. Initialize OpenAI

from langchain_openai import ChatOpenAI
from google.colab import userdata


# Open AI API key is stored in the Secrets vault in Google Colab
OPENAI_API_KEY = userdata.get('openai_api_key')

llm = ChatOpenAI(
    temperature=0,
    openai_api_key=OPENAI_API_KEY,
    model_name=model_name)

5. Define summarization prompt

from langchain.prompts import PromptTemplate

# Use the prompt "List the skills mentioned in below resume:" to list the skills alone

prompt_template = """Summarize below resume:

{text}

"""

prompt = PromptTemplate(template = prompt_template, input_variables=["text"])

6. Summarization

from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=prompt)

#I am ignoring the chunking aspect
chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
summary = chain.run(docs)

7. Print Summary

import textwrap
print(textwrap.fill(summary, width=100))

Example output: <name-removed> is a passionate researcher with a focus on cutting-edge technology such as Machine Learning, Computer Vision, and Deep Learning. He has experience as an Associate Data Scientist-Trainee at Lincode Labs and as an AI/ML Intern. His roles included data collection and cleaning, extending code modules, experimenting and deploying machine/deep learning models, and handling end-to-end processes. He has worked on various projects related to object detection, OCR detection, and classification in the manufacturing domain. <name-removed> has a Bachelor’s degree in Computer Science and skills in Python, machine learning platforms, frameworks, libraries, and tools. He has also worked on academic and personal projects related to border security systems and house price prediction.

Angular-.NET knowledge refresh

Ok, last week I was on an Angular+.NET Core learning spree, and I had to continue since my goals were not met.

  • Monday – Angular foundation
  • Tuesday – Angular foundation cont. – services, .NET Core Web API setup, EF Core
  • Wednesday – continue same… had issues to debug
  • Thursday – replaced bootstrap with ng-bootstrap, for no reason
  • Friday – continues ng-bootstrap wars, and unhappy. Decides to learn Angular Material after reading UI library comparisons
  • Saturday, Today – continues with Angular Material

Wavelength of thoughts

Have you ever wondered why there are certain personalities with people around them – both who likes and hates them ? Here is my thought… its the wavelength difference.

Some people might be considering you as a genius but for others you might be a fool, or a person who love to ask stupid questions. I observe people and try to stand in other’s shoes usually to find why a particular person is like that. This guy might be thinking much faster, or slower than you buddy!

When we talk in discussions, we tend to think, analyse and articulate next statement in mind, which is a (semi) parallel activity. Your pace of thoughts and articulating skills matters. The opponent might be a fast thinker in the same topic of discussion and while you are at point A, he might be at point D so he might be talking about things which are highly relevant to him but you are not even in point B so you have synchronization problems. Vice versa also can happen, and its nobody’s fault. Give more importance to your listening skills so you can race up to his speed and make sense out of what he says.

Web 3.0 is here…

While reading about Web3, it initially reminded me for no reason the decentralized internet concept discussed in the famous Silicon Valley television series. Well, I thought to write about my learning on the reading of this buzz topic. I started my love for internet with the sound of dial-up internet so I was lucky enough to experience different browsers, sockets, chat apps, HTML, compatibilities and incompatibilities, and various phases of web standards. Initially it was purely technology focused but it has turned to human focused now. Thanks to the evolving Customer Experience (CX) and design thinkers’ priorities and involuntary digital revolution.

Web3 is based on Blockchain

Specifically, the decentralization and, that is the major upgrade we have from Web 2.0. Data volumes, importance and complexities has raised to uncontrollable states now and it has become a need of the hour for the enterprises to keep a log book of what is happening with data. Systems are forced today to check whether you are a human being and not a bot, before allowing you to do any transactions. Captcha, Multi-Factor-Authentication (MFA), Single Sign On, Face Recognition, and what not. Companies are investing hugely to protect their data by securing the systems and we were limited to traditional authentication and role based security based authorizations as enterprise standard. For many years sensitive industries such as Banking sector were reluctant to use Cloud systems because of trust issues and Cloud vendors such as Microsoft Azure and Amazon Web Services had hard time selling their products to enterprises. Consumer sector is still feared about Alexa being listening to their conversations, and industries are concerned about the data leakage of their IoT devices. With Web 3.0, and the proven architecture of crypto designs such as blockchain is going to significantly change the outlook and the on-premise systems will soon completely move to a decentralized mode.

It’s intelligent

Web 3.0 is expected to be intelligent than the previous generations because now we have figured out what internet can do, and what we want. We have advanced (but still immature in many area) much in data science, analytics and predictions so it is time to have the systems we build also have these learnings available ‘by design’.

Is it just a hype?

No, it cannot be. The concepts doesn’t talk anything unrealistic, but it vouch for the need of the hour.

More reading

Visit https://web3.foundation/ to read more. The technology stack page is very interesting.