5 minute read

Link to repo: LLM repo. Check it out to know more about dependencies, Python version, and all other technical information. This page will be mostly divulgative.

Learn how to build a very simple RAG retrieving information from a folder, with any LLM, depending on your computing power.

We will deploy a simply RAG + LLM in Python. I will show the most important bits of code that can be useful, using langchain. I won’t go into more detail for the actual containerized deployment.

Once deployed, we will ask a question about a specific scientific paper: “An AI Chatbot for Explaining Deep Reinforcement Learning Decisions of Service-oriented Systems”.

Specifically, we will ask a question whose answer is in the abstract, which reads:

“Deep Reinforcement Learning (DeepRL) is increasingly used to cope with the open-world assumption in service-oriented systems.”

We will ask the LLM whether it is true that DeepRL is increasingly used to cope with the open-world assumption in service-oriented systems.

Very simple. We will ask first the LLM, then the LLM with RAG, which has direct access to this paper, as we will set it up with a local folder as knowledge database.

RAG

We set up the RAG using langchain. Here, we use the DirectoryLoader, where we indicate that every PDF file should be included (recursively). Download the mentioned paper above, and place it in this folder: res/documents.

For example:

import os, sys
import typing as ty

from langchain_core.documents import Document
from langchain_community.document_loaders.directory import DirectoryLoader

loader = DirectoryLoader(
    path=os.path.join('res', 'documents'),
    glob="*.pdf",
    recursive=True,
)

Then, we can load the documents:

docs: ty.List[Document] = loader.load()

Now we get ready to create our vector database, which is at the core of RAG. We set up a text splitter, with a specified chunk size and overlap.

The chunk size indicates how much text (how many tokens) each vector should embed, and the overlap indicates how much text two given vectors may have in common.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

Here indeed, we select the embedder, to create the vectors. And from that we create our database.

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS

# For all model names, see: https://www.sbert.net/docs/pretrained_models.html
embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
db = FAISS.from_documents(chunked_docs, embedding=embedding)

The retriever is also very important in RAG application. Given the user’s question, the retriever is responsible to find, in the vector database, the vectors that can be the most relevant to be retrieved as context for the LLM before it generates the answer. As LLMs have a limited/finite context window and number of tokens they can handle, it is useful to be able to access a possibly huge database and select what’s important.

Here we choose for search_type="similarity", which is the cosine similarity, but more complex similarity measures can be chosen or developed.

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

LLM

Here the LLM part.

We select an open-source LLM from HuggingFace. To do so, notice that you need the HUGGINGFACE_TOKEN environment variable.

The most interesting thing bit here, is the use of BitsAndBytesConfig, which helps quantize the model. These LLM models can be huge and may crash your machine, so thsi can be useful. It works only on CUDA, though.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_available():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        # llm_int8_enable_fp32_cpu_offload=True,
    )
else:
    bnb_config = None

print(f"Using config: {bnb_config}")

model_name = "HuggingFaceH4/zephyr-7b-beta"
# model_name = "mistralai/Mistral-7B-v0.1"
# model_name = "Writer/palmyra-small" # Very small model, not sure this works well

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    token=os.environ['HUGGINGFACE_TOKEN'],
)

tokenizer = AutoTokenizer.from_pretrained(model_name, token=os.environ['HUGGINGFACE_TOKEN'])

Pipeline

Here is the general huggingFace pipeline, created by passing our model (LLM) with the tokenizer.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

LLM without RAG

We create our LLM chain, consisting of the prompt + the LLM (and a StrOutputParser for convenience).

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

LLM with RAG

Here, we chain our RAG before the LLM chain built above.

from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain

Question

This will be our question:

question = "Deep Reinforcement Learning (Deep RL) is increasingly used to cope with the open-world assumption in service-oriented systems. Is this true?"

The sentence “Deep Reinforcement Learning (Deep RL) is increasingly used to cope with the open-world assumption in service-oriented systems” is the first sentence of the abstract of the paper the RAG is retrieving context from.

Of course, the quality of the answers will also depend on the LLM model you chose.

Answers

Withotu a RAG:

llm_chain.invoke({"context": "", "question": question})

With a RAG:

rag_chain.invoke(question)

Here they are:

model:
  name: HuggingFaceH4/zephyr-7b-beta

  question: "Deep Reinforcement Learning (Deep RL) is increasingly used to cope with the open-world assumption in service-oriented systems. Is this true?"

  answer-wo-rag: "No, Deep Reinforcement Learning (Deep RL) is not necessarily increasing in use to cope with the open-world assumption in service-oriented systems. While Deep RL can be a powerful tool for learning and decision-making in dynamic and uncertain environments, it may not be the best approach for all types of service-oriented systems. The open-world assumption refers to situations where new, unseen events or inputs can occur, which requires the system to adapt and learn on the fly. However, the specific requirements and constraints of each service-oriented system will determine whether Deep RL is the most appropriate solution. Therefore, while Deep RL has shown promise in some applications, its usage in service-oriented systems should be evaluated on a case-by-case basis."

  answer-w-rag: "Yes, according to the provided context, Deep Reinforcement Learning (Deep RL) is increasingly being used to address the open-world assumption in service-oriented systems. This information is mentioned in the abstract and also referenced in the section discussing related work. The authors note that Deep RL has been successfully applied to problems such as dynamic service composition, job scheduling, and offloading, as well as service adaptation, which further supports the claim that Deep RL is gaining popularity in this context."

As you can see, with RAG, the same model was confident in replying “Yes”, as the abstract of the paper that is in the RAG database says so.

Tags: ,

Categories:

Updated: