Exploring LangChain's Quickstart (2) - Extending LLM knowledge

In this series, we’ll explore the ‘Quickstart’ section of the LangChain documentation.
In this article, we discuss how to expand LLM knowledge using information on the internet.

Below, I outline the sections of code from our previous article that we’ll use again in this article.

import os
from langchain_openai import ChatOpenAI

# Set the API key from an environment variable
with open('.openai') as f:
    os.environ['OPENAI_API_KEY'] = f.read().strip()

# Load the large language model
llm = ChatOpenAI()

As we’ve seen, the gpt-3.5-turbo we’ve been using has outdated knowledge, failing to correctly answer “What is LangChain?”.

However, LangChain includes a mechanism to fetch and use website information, enabling accurate responses to queries.

In LangChain, text data is managed as objects called “documents”. This section covers how to turn website data into documents.

LangChain uses BeautifulSoup to retrieve web data. Install it using the following command:

pip install beautifulsoup4

To fetch web content, use the WebBaseLoader. Pass the URL as an argument to the load method, and it will return a list of Document objects:

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://python.langchain.com/docs/get_started/introduction")
docs = loader.load()

When dealing with huge content, it’s often impractical or costly to process the entire content directly through LLM. Instead, a technique called RAG (Retrieval-Augmented Generation) is used. This involves pre-processing to extract relevant sections for generating answers. The process follows these steps:

  1. Split the content into LLM-manageable document lengths, and use Embeddings to convert each document into a vector.
  2. Save these vectors in a database known as a vector store, creating an efficient search base.
  3. Convert user input into a vector and search the vector store for highly relevant vectors. This retrieves a list of documents related to the user’s query.
  4. Input the user’s query and the retrieved document list into the LLM to generate an answer.

First, prepare the Embeddings to convert documents into vectors. You can use OpenAI’s Embeddings. The default model is text-embedding-ada-002.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

Next, create a vector store to save the vectors:

First, install the faiss-cpu library required to create a local vector store.

pip install faiss-cpu

After installation, create the vector store by saving vector data as shown below:

from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Splitting documents
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)

# Vectorizing split documents and creating a vector store
vector = FAISS.from_documents(documents, embeddings)

Next, prepare a mechanism to extract relevant vectors from the vector store based on user input. This mechanism is called a retriever.

retriever = vector.as_retriever()

The retriever’s invoke method takes a string input and returns a list of related documents:

retriever.invoke("What is LangChain?")
  • Execution Result

    [Document(page_content='Introduction | 🦜️🔗 LangChain', metadata={'source': 'https://python.langchain.com/docs/get_started/introduction', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}),
     Document(page_content="Skip to main contentComponentsIntegrationsGuidesAPI ReferenceMorePeopleVersioningContributingTemplatesCookbooksTutorialsYouTube🦜️🔗LangSmithLangSmith DocsLangServe GitHubTemplates GitHubTemplates HubLangChain HubJS/TS Docs💬SearchGet startedIntroductionQuickstartInstallationUse casesQ&A with RAGExtracting structured outputChatbotsTool use and agentsQuery analysisQ&A over SQL + CSVMoreExpression LanguageGet startedRunnable interfacePrimitivesAdvantages of LCELStreamingAdd message history (memory)MoreEcosystem🦜🛠️ LangSmith🦜🕸️LangGraph🦜️🏓 LangServeSecurityGet startedOn this pageIntroductionLangChain is a framework for developing applications powered by large language models (LLMs).LangChain simplifies every stage of the LLM application lifecycle:Development: Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates.Productionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.Deployment: Turn any chain into an API with LangServe.Concretely, the framework consists of the following open-source libraries:langchain-core: Base abstractions and LangChain Expression Language.langchain-community: Third party integrations.Partner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into their own lightweight packages that only depend on langchain-core.langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.langgraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.langserve: Deploy LangChain chains as REST APIs.The broader ecosystem includes:LangSmith: A developer platform that lets you debug, test, evaluate, and monitor LLM applications and seamlessly integrates with LangChain.Get started\u200bWe recommend following our Quickstart guide to familiarize yourself with the framework by building your first LangChain application.See here for instructions on how to install LangChain, set up your environment, and start building.noteThese docs focus on the Python LangChain library. Head here for docs on the JavaScript LangChain library.Use cases\u200bIf you're looking to build something specific or are more of a hands-on learner, check out our use-cases.", metadata={'source': 'https://python.langchain.com/docs/get_started/introduction', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}),
     Document(page_content="They're walkthroughs and techniques for common end-to-end tasks, such as:Question answering with RAGExtracting structured outputChatbotsand more!Expression Language\u200bLangChain Expression Language (LCEL) is the foundation of many of LangChain's components, and is a declarative way to compose chains. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains.Get started: LCEL and its benefitsRunnable interface: The standard interface for LCEL objectsPrimitives: More on the primitives LCEL includesand more!Ecosystem\u200b🦜🛠️ LangSmith\u200bTrace and evaluate your language model applications and intelligent agents to help you move from prototype to production.🦜🕸️ LangGraph\u200bBuild stateful, multi-actor applications with LLMs, built on top of (and intended to be used with) LangChain primitives.🦜🏓 LangServe\u200bDeploy LangChain runnables and chains as REST APIs.Security\u200bRead up on our Security best practices to make sure you're developing safely with LangChain.Additional resources\u200bComponents\u200bLangChain provides standard, extendable interfaces and integrations for many different components, including:Integrations\u200bLangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it. Check out our growing list of integrations.Guides\u200bBest practices for developing with LangChain.API reference\u200bHead to the reference section for full documentation of all classes and methods in the LangChain and LangChain Experimental Python packages.Contributing\u200bCheck out the developer's guide for guidelines on contributing and help getting your dev environment set up.Help us out by providing feedback on this documentation page:NextIntroductionGet startedUse casesExpression LanguageEcosystem🦜🛠️ LangSmith🦜🕸️ LangGraph🦜🏓 LangServeSecurityAdditional resourcesComponentsIntegrationsGuidesAPI referenceContributingCommunityDiscordTwitterGitHubPythonJS/TSMoreHomepageBlogYouTubeCopyright © 2024 LangChain, Inc.", metadata={'source': 'https://python.langchain.com/docs/get_started/introduction', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'})]
    

Next, create a chain to generate answers using the user’s input and the list of documents containing the answer information.

First, we will create a prompt template. In this template, we designate where to insert the user’s input with {input}, and where to place the list of documents with {context}.

from langchain_core.prompts import ChatPromptTemplate


template = ChatPromptTemplate.from_template("""Please answer the following question based solely on the given context:

<context>
{context}
</context>

Question: {input}""")

Use create_stuff_documents_chain to create a chain that generates answers based on the user input and document list:

from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm, template)

Use the created chain’s invoke method as shown below:

from langchain_core.documents import Document

document_chain.invoke({
    "input": "What is LangChain?",
    "context": [Document(page_content="LangChain is installed with `pip install langchain`.")]
})
  • Execution Result

    'LangChain is a software or package that can be installed using pip.'
    

By combining the retriever with the document chain, we can generate answers that reference information from websites.

user_input = "What is LangChain?"
context = retriever.invoke(user_input)
document_chain.invoke({
    "input": user_input,
    "context": context
})
  • Execution Result

    'LangChain is a framework for developing applications powered by large language models (LLMs). It simplifies every stage of the LLM application lifecycle, including development, productionization, and deployment. It consists of open-source libraries such as langchain-core, langchain-community, langchain, langgraph, and langserve. LangChain also includes partner packages for third party integrations.'
    

So far, we’ve created and discussed how to use the following:

  • retriever: Returns a list of documents related to the input string.
  • document_chain: Generates LLM responses based on the user’s question and document list.

Finally, automatically combine these two to create a chain that answers user questions by referencing documents. This is easily done using create_retrieval_chain:

from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(retriever, document_chain)

Try using the created chain. You can see that an answer is generated by referencing documents:

response = retrieval_chain.invoke({"input": "What is LangChain?"})
print(response["answer"])
  • Execution Result

    LangChain is a framework for developing applications powered by large language models (LLMs). It simplifies every stage of the LLM application lifecycle, including development, productionization, and deployment. It consists of open-source libraries, third-party integrations, and partner packages aimed at building cognitive architectures for applications. LangChain also includes components like LangSmith, LangGraph, and LangServe within its broader ecosystem.
    

Related Content