Chat with Data from a Website using LangChain and OpenAI

Chat with Data from a Website using LangChain and OpenAI

In this tutorial, we will walk through the steps to create a simple Q&A system that retrieves and answers questions based on information from a specific Wikipedia page. This system leverages the power of LangChain, OpenAI's GPT-3.5-turbo, and LangChain Hub. By the end of this guide, you'll have a working example of how to integrate these tools to create a practical application.

Prerequisites

Before we start, ensure you have the following installed:

  • Python 3.11 or higher

  • pip (Python package installer)

You will also need to install the necessary dependencies. Open your terminal and run the following command:

pip install langchain openai langchain-community chromadb tiktoken langchainhub

Additionally, you need to set the OPENAI_API_KEY environment variable to authenticate with the OpenAI API. You can do this by running the following command in your terminal:

export OPENAI_API_KEY='your-openai-api-key'

Replace 'your-openai-api-key' with your actual OpenAI API key.

Step 1: Setting Up the Project

First, let's import the required modules and initialize the document loader.

from langchain import hub
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

Step 2: Loading and Splitting the Document

We will load a Wikipedia page using WebBaseLoader. For this tutorial, we'll use the Wikipedia page of Korean actress Kim Ji-won. After loading the data, we will split the document into smaller chunks to make it easier for our model to process.

# You can add multiple URLs here
loader = WebBaseLoader(["https://en.wikipedia.org/wiki/Kim_Ji-won_(actress)"])
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

Step 3: Creating the Vector Store

Next, we need to convert the text chunks into embeddings using OpenAI's embeddings model and store these embeddings in a vector store. We'll use Chroma as our vector store.

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

Step 4: Setting Up the Prompt

We'll use a pre-defined prompt from LangChain Hub. Prompts are essential as they guide the model on how to structure its responses.

prompt = hub.pull("donvito-codes/rag-prompt")

Here is the Prompt configured in LangChain Hub

Step 5: Initializing the Language Model

We will use OpenAI's GPT-3.5-turbo model for our Q&A system. The temperature parameter controls the randomness of the model's responses. A lower temperature means more deterministic outputs.

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

Step 6: Creating the RetrievalQA Chain

We will create a RetrievalQA chain using the language model and the vector store retriever. The chain_type_kwargs parameter allows us to pass additional settings to the chain, such as the prompt.

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": prompt}
)

Step 7: Asking a Question

Now, we can ask a question and get an answer from our Q&A system. For this example, we'll ask about Kim Ji-won's latest TV series.

question = "Which latest TV series did Kim Ji-won star in 2024?"
result = qa_chain({"query": question})
print(result["result"])

Here's the full working source code

from langchain import hub
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader(["https://en.wikipedia.org/wiki/Kim_Ji-won_(actress)"])
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

prompt = hub.pull("donvito-codes/rag-prompt")

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": prompt}
)

question = "which latest tv series did Kim Ji Won star on?"
result = qa_chain({"query": question})
print(result["result"])

Save the file as main.py and run it

python main.py

You've now created a basic Q&A system that pulls data from a Wikipedia page, processes it into embeddings, and uses a language model to retrieve and answer questions. This is a powerful example of how you can leverage LangChain and OpenAI to create intelligent applications.

Feel free to experiment with different Wikipedia pages, questions, and prompts to see how the system performs. The flexibility of LangChain and OpenAI allows for a wide range of applications beyond just Q&A systems. Happy coding!

Additional Resources


If you're interested in learning more about developing with Generative AI, subscribe to my blog for more tutorials and sample code.