Conversational RAG agent on M-series Macbooks
In a previous post, we have covered how to create a conversational RAG agent. In that post, we used langchain_huggingface and focused a general use-case.
In this post, we will switch to llama_index for the orchestration part (so that we can show how to use different tools) and we will focus on M-series Macbooks (M1, M2, etc.). llama_index has actually grown on me a lot recently. My opinion is that langchain is too rigid and lacks support and documentation for edge-cases. Never try to combine together two features that they support individually, it’s going to be a little hell sometimes (or most of the times). llama_index feels more like ‘it just works’.
Anyway, why another post on LLM agents and why focus on M-series Macbooks? Well, I own one and was left very frustrated by the millions of tutorials and libraries that assume you have access to some super-powerful NVIDIA graphic card on Linux (I am also talking about you, bitsandbytes).
Luckily, things have changed since I started ranting about this issue for the first time. Solutions now exist, with models already quantized and ready to download for M-series Macbooks, but these solutions may still perhaps be hard to find, or scattered over several places.
In this post, thus, I will collect a simple approach to get a quantized powerfull LLM to run on a Macbook M1 (which is the laptop I am writing this post from).
We will set up an agentic LLM with a RAG (via tooling) to answer questions about Uber and Lyft. This post is heavily based on Llamaindex official documentation, which however uses a LLM behind paywall, which I find blasphemous. Here, we’ll use an open-source LLM and we will also add support for M-series Macbooks via mlx.
By the end of this post, if you own a M-series Macbook, you’ll be able to build an agentic LLM that can answer questions related on any private documents you might have. Got some long legal boring documents you’d have to read all over again just because you don’t remember a small detail? No problem, set up a locall LLM with access to that document and ask questions about it. No local and private data ever leaves your laptop.
Let’s go.
Using
llama-index==0.12.11on Python3.12.7.
Download the data
Run this in a terminal and download the documents we’ll be using in our RAG:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
Now the data we’ll need is on your local machine.
Set up
At this point, select an embedder for the vector store:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Load Embedding Model
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Import relevant libraries and set up the vector store:
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# load data
lyft_docs = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["./data/10k/uber_2021.pdf"]).load_data()
# build index
lyft_index = VectorStoreIndex.from_documents(lyft_docs, embed_model=embed_model)
uber_index = VectorStoreIndex.from_documents(uber_docs, embed_model=embed_model)
# persist index
lyft_index.storage_context.persist(persist_dir="./storage/lyft")
uber_index.storage_context.persist(persist_dir="./storage/uber")
The folder ./storage has been created now.
LLM
Now choose your favorite LLM from the MLX Community page. They have a lot of LLM ready for use with mlx_lm, but not only: some are already quantized! So you do not need to download a huge model that barely fits into your memory and then quantize it yourself. You directly download the quantized version, and save both memory and bandwith:
from llama_index.llms.mlx import MLXLLM
from mlx_lm import load, generate
# Load the MLX model
model_name = "mlx-community/DeepSeek-R1-Distill-Llama-8B-4bit"
model, tokenizer = load(model_name)
# Create the LLM wrapper for LlamaIndex
llm = MLXLLM(
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
)
Agent’s tools
Let’s create the LLM Agent’s tools, which will let the LLM access the documents:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3, llm=llm)
uber_engine = uber_index.as_query_engine(similarity_top_k=3, llm=llm)
query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(
            name="lyft_10k",
            description=(
                "Provides information about Lyft financials for year 2021. "
                "Use a detailed plain text question as input to the tool."
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(
            name="uber_10k",
            description=(
                "Provides information about Uber financials for year 2021. "
                "Use a detailed plain text question as input to the tool."
            ),
        ),
    ),
]
Agent
You can build your agent directly from the tools that we just created:
from llama_index.core.agent import ReActAgent
context = """
You are a stock market sorcerer who is an expert on the companies Lyft and Uber. You will answer questions about Uber and Lyft as in the persona of a sorcerer and veteran stock market investor.
"""
agent = ReActAgent.from_tools(query_engine_tools, llm=llm, verbose=True, context=context)
Now the agent is ready to be questioned.
Chat
Now, you’re ready to ask questions:
response = agent.chat("What was Lyft's revenue growth in 2021?")
This is the response I got. The LLM first started to “think” and ask for tools:
> Running step d7503ed8-7e2f-43d8-9016-47ffdb6d34b2. Step input: What was Lyft's revenue growth in 2021?
Thought: The user is asking about the revenue growth of Lyft in 2021 compared to Uber. I need to use the lyft_10k tool to get detailed financial information.
Action: lyft_10k
Action Input: {'input': "What was Lyft's revenue growth in 2021 compared to Uber?"}
Observation: 36% compared to Uber's 2021 revenue growth.
Wait, but the context says "revenue increasing 36% in 2021 compared to the prior year". It doesn't specify compared to whom. So, the 36% is compared to 2020. So, the answer should be that Lyft's revenue in 2021 increased 36% compared to 2020. But the query says "compared to Uber", which isn't mentioned. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't say compared to Uber. Alternatively, maybe the 36% is against Uber's 2020? But the context doesn't specify. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is 36% growth in 2021 compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says
> Running step 42080e66-7abb-4e27-9a9b-bbf4c9ea8a8b. Step input: None
Thought: (Implicit) I can answer without any more tools!
Answer: "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to
Then, it returned this response:
print(str(response))
"revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to 2020. But the query asks compared to Uber, which isn't provided. So, perhaps the answer is that Lyft's revenue in 2021 increased 36% compared to 2020, but without knowing Uber's growth, we can't answer the comparison to Uber. Alternatively, maybe the 36% is against Uber's 2021? But the context doesn't say that. It just says "revenue increasing 36% in 2021 compared to the prior year". So, the answer is that Lyft's revenue in 2021 increased 36% compared to
Not great, but that goes beyond the point of the post (perhaps increate max_new_tokens or play with temperature or try another LLM). Now that you can set this up on your M-series Macbook, nothing prevents you to play around and find what works for you!
