[10' read]

00 — Concepts: a visual tour of the Mistral Search Toolkit

RAG

This notebook is the on-ramp to all the other cookbooks. You won't deploy anything here, you will instead build the mental model that every later cookbook assumes. Feel free to skip this one if you're already familiar with RAG in other frameworks.

What you'll walk away with#

By the end of this notebook you should be able to answer, in your own words:

What problem does RAG solve, and where does retrieval fit in the picture?
What are documents, chunks, embeddings, and a vector store?
What is hybrid search, and what does RRF combine?
Why does the toolkit expose ingestion as a composable Pipeline of swappable parts?
How does the QueryEngine orchestrate retrieval at query time?

Setup#

The two code cells in this notebook (embeddings demo + RRF demo) only need this cookbook workspace installed:

cd search/cookbooks
uv sync

cd search/cookbooks
uv sync

The embeddings demo also needs a MISTRAL_API_KEY in your environment (a .env file at the cookbook root is loaded automatically).

1. Why RAG? The problem retrieval solves#

Large language models are great at reasoning over text, but they have two persistent weaknesses:

They don't know about your documents (internal documents, contracts, scientific PDFs, customer tickets…).
Their training data has a cutoff date and to keep realtime data, you need to feed it realtime data.

Retrieval-Augmented Generation (RAG) has two AI components:

A retriever finds the few passages from your corpus that are most relevant to a user's question.
An LLM is given those passages as context and asked to answer using them.

The LLM does not hallucinate from thin air because it now has grounded evidence, and you can get up to date knowledge simply by re-ingesting documents or getting them from live sources.

2. Two pipelines, one shared store#

In order to search for documents, you first need to index them in the vector store. There are two pipelines, and they meet in the middle at the vector store.

Pipeline	When it runs	Input	Output
Ingestion	Offline / batch, when documents change	Raw files (PDFs, etc.)	Indexed chunks in Vespa
Search	Online, on every user query	A query string	A ranked list of chunks

Ingestion is the slow, expensive, write-heavy half (OCR, embedding, indexing). Search is the fast, read-heavy half (embed the query, hybrid search, return top k). They share nothing at runtime except the vector store.

The toolkit reflects this split:

Ingestion is built around the Pipeline class (loader → extractor → splitter → embedder → store).
Search is built around the QueryEngine class (retriever(s) → optional rewriter/reranker).

3. Ingestion at a glance — the `Pipeline`#

Every ingestion script in this repo looks the same. Here it is, taken directly from 01-quickstart/ingest.py:

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=MistralOCRExtractor(client=mistral_client),
    text_splitter=MarkdownTextSplitter(MarkdownTextSplitterConfig(chunk_size=5048, chunk_overlap=50)),
    embedder=MistralEmbedder(client=mistral_client),
    stores=vector_store,
)

await pipeline.run(documents=[Path("my.pdf")], use_checkpoint=False)

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=MistralOCRExtractor(client=mistral_client),
    text_splitter=MarkdownTextSplitter(MarkdownTextSplitterConfig(chunk_size=5048, chunk_overlap=50)),
    embedder=MistralEmbedder(client=mistral_client),
    stores=vector_store,
)

await pipeline.run(documents=[Path("my.pdf")], use_checkpoint=False)

The Pipeline is just an orchestrator that wires together five swappable slots. Each slot has one job:

Slot	Job	Example from `01-quickstart`
`loader`	Where do the files live? Fetch the raw file.	`FilesystemFileLoader`
`extractor`	Turn files into clean text.	`MistralOCRExtractor` (OCR aware, returns a `Document` of markdown pages)
`text_splitter`	Chop the text into chunks small enough to embed and precise enough to retrieve.	`MarkdownTextSplitter` (respects markdown structure, 50-char overlap between chunks)
`embedder`	Turn each chunk into a fixed-length vector.	`MistralEmbedder` (via the Mistral embeddings API)
`stores`	Where to write the indexed chunks.	`VespaSearchIndex`

Every later cookbook is a small variation on this shape:

02-advanced-indexing/ingest.py adds a checkpoint directory and a progress callback.
ingest_from_s3.py swaps FilesystemFileLoader for FileLoader(S3BlobStorage(...)).
ingest_with_metadata.py passes an extra chunk_enrichers=[CustomMetadataEnricher()] argument.
ingest_with_summaries.py passes chunk_enrichers=[SummaryEnricher(...)].

4. Embeddings — turning text into geometry#

An embedding is a fixed-length list of floats (generally 1024 numbers) that represents a chunk's meaning as a point in high-dimensional space. The whole magic of vector search rests on one property:

Texts with similar meaning land close together in this space, and they do so based on the context the words appear in, not the words themselves.

That single property does two things at once:

Two passages with different words but the same meaning end up close together — e.g. "a ball python coiled around a tree branch" and "the constrictor snake wrapped itself around the perch" share almost no vocabulary but are near-neighbours in embedding space. This is what lets you find relevant chunks even when the user's phrasing doesn't match the indexed text.
Two passages with the same word but different meanings end up far apart — "a ball python coiled around a branch" and "a Python script that scrapes an API" share the surface token python but live in completely different regions of the space, because the surrounding context tells the model these are unrelated concepts.

The same embedding model is used at ingestion time (to embed every chunk before storing) and at query time (to embed the user's query). That symmetry is what makes the geometry comparable on both sides.

The next code cell calls the Mistral embedding API directly on a small corpus that mixes the two senses of the word "python", then issues two queries that pull out two different subsets of the same corpus.

"""Same word, two meanings. what 'context' really means for embeddings.

We call the Mistral embedding API directly (no toolkit wrapper) on a small
corpus where the token ``python`` appears in every entry, but with two
completely different meanings: half the sentences are about the snake, the
other half about the programming language. We then issue two queries — one
biological, one software-engineering — and rank the corpus by cosine
similarity to each. The two queries pull out *different* subsets of the
corpus, even though the literal word overlap is identical.
"""

import os

import numpy as np
from dotenv import load_dotenv
from mistralai.client import Mistral

load_dotenv()

client = Mistral(
    api_key=os.environ["MISTRAL_API_KEY"],
    server_url=os.getenv("MISTRAL_API_URL"),
)


def get_text_embedding(text: str) -> list[float]:
    response = client.embeddings.create(model="mistral-embed", inputs=text)
    return response.data[0].embedding


corpus = [
    # python the snake
    "Pythons are large, non-venomous constrictor snakes native to tropical Africa and Southeast Asia.",
    "A ball python coiled itself around a low branch deep in the rainforest.",
    "Reticulated pythons can grow over six metres long and are among the largest reptiles on Earth.",
    # python the programming language
    "Python is one of the most popular programming languages for data science and machine learning.",
    "We wrote a small Python script that scrapes the API and dumps the results to a CSV file.",
    "Django and FastAPI are mature Python web frameworks used in production by major companies.",
]
corpus_labels = ["snake"] * 3 + ["language"] * 3

queries = ["snakes in the tropical rainforest", "scripting language for data analysis"]

corpus_vectors = np.array([get_text_embedding(text) for text in corpus])
query_vectors = np.array([get_text_embedding(q) for q in queries])


def cosine(a: np.ndarray, b: np.ndarray) -> float:
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))


for query, query_vec in zip(queries, query_vectors):
    print(f"Query: {query!r}")
    ranked = sorted(zip(corpus, corpus_vectors), key=lambda kv: -cosine(query_vec, kv[1]))
    for text, vec in ranked:
        print(f"  {cosine(query_vec, vec):+.3f}   {text}")
    print()

"""Same word, two meanings. what 'context' really means for embeddings.

We call the Mistral embedding API directly (no toolkit wrapper) on a small
corpus where the token ``python`` appears in every entry, but with two
completely different meanings: half the sentences are about the snake, the
other half about the programming language. We then issue two queries — one
biological, one software-engineering — and rank the corpus by cosine
similarity to each. The two queries pull out *different* subsets of the
corpus, even though the literal word overlap is identical.
"""

import os

import numpy as np
from dotenv import load_dotenv
from mistralai.client import Mistral

load_dotenv()

client = Mistral(
    api_key=os.environ["MISTRAL_API_KEY"],
    server_url=os.getenv("MISTRAL_API_URL"),
)


def get_text_embedding(text: str) -> list[float]:
    response = client.embeddings.create(model="mistral-embed", inputs=text)
    return response.data[0].embedding


corpus = [
    # python the snake
    "Pythons are large, non-venomous constrictor snakes native to tropical Africa and Southeast Asia.",
    "A ball python coiled itself around a low branch deep in the rainforest.",
    "Reticulated pythons can grow over six metres long and are among the largest reptiles on Earth.",
    # python the programming language
    "Python is one of the most popular programming languages for data science and machine learning.",
    "We wrote a small Python script that scrapes the API and dumps the results to a CSV file.",
    "Django and FastAPI are mature Python web frameworks used in production by major companies.",
]
corpus_labels = ["snake"] * 3 + ["language"] * 3

queries = ["snakes in the tropical rainforest", "scripting language for data analysis"]

corpus_vectors = np.array([get_text_embedding(text) for text in corpus])
query_vectors = np.array([get_text_embedding(q) for q in queries])


def cosine(a: np.ndarray, b: np.ndarray) -> float:
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))


for query, query_vec in zip(queries, query_vectors):
    print(f"Query: {query!r}")
    ranked = sorted(zip(corpus, corpus_vectors), key=lambda kv: -cosine(query_vec, kv[1]))
    for text, vec in ranked:
        print(f"  {cosine(query_vec, vec):+.3f}   {text}")
    print()

What just happened. Every line in the corpus contains the word python, so a naive keyword retriever (BM25) would rank them all identically for either query. But the embedding model encodes the surrounding context — coiled around a branch, tropical rainforest, Django, script, CSV file — and the two queries land in two different neighbourhoods of vector space:

"snakes in the tropical rainforest" pulls the three reptile sentences to the top.
"scripting language for data analysis" pulls the three programming sentences to the top.

That's the property the rest of the toolkit relies on: at retrieval time, embeddings turn "find passages about X" into "find the nearest vectors to the embedding of X", where "about X" includes context that the literal words don't capture.

5. The vector store and hybrid search#

The vector store is the database that holds every chunk + its embedding and answers retrieval queries. These cookbooks use Vespa, which is interesting precisely because it isn't just a vector database: for each chunk it stores

the chunk's text — indexed for BM25 keyword search (the classical tf-idf relevance score),
the chunk's embedding vector — indexed for approximate nearest-neighbour (ANN) search,
the chunk's metadata — for filtering at query time (filename, page_number, custom fields you attach via a ChunkEnricher).

This is what makes hybrid search possible in a single query: each chunk is matched both by keyword and by semantic similarity, and the two ranked lists are fused into one.

Why bother with both?#

Each signal fails differently:

BM25 wins when the user types an exact term that only matters in a few documents — product codes, error messages, named entities, acronyms. Embeddings tend to smooth those out by mapping them near generic neighbours.
Vector search wins when the user paraphrases or asks at a higher level of abstraction — "treatments for cancer" matching a chunk that only contains "chemotherapy" and "radiotherapy" but never the word "cancer".

You really want both. The question is: how do you combine two ranked lists with completely different score scales (BM25 produces unbounded positive numbers, cosine similarity is in [-1, 1])?

The answer is Reciprocal Rank Fusion (RRF): forget the raw scores, look only at the rank of each chunk in each list, and use a simple formula.

$$ \text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)} $$

with k typically 60. A chunk gets a high score by being near the top of at least one of the lists. If it's near the top of both, even better.

6. The search side — `QueryEngine` and retrievers#

Once documents are indexed, the search side is much smaller. The toolkit exposes one orchestrator, QueryEngine, which takes a list of retrievers and (optionally) post-processing components like rewriters and rerankers.

The simplest possible setup — the one in 01-quickstart/search.py — is one retriever, no rewriter, no reranker:

query_engine = QueryEngine(
    retriever=[VectorRetriever(client=vector_store, embedder=embedder)],
)
result = await query_engine.search(query="What is the main topic?", top_k=5)

query_engine = QueryEngine(
    retriever=[VectorRetriever(client=vector_store, embedder=embedder)],
)
result = await query_engine.search(query="What is the main topic?", top_k=5)

VectorRetriever issues a hybrid query (BM25 + ANN + RRF), not a pure vector query. The name is historical; treat it as "the standard retriever".

03-advanced-search enriches this query-time pipeline with three classes you'll meet there:

LLMQueryRewriter — rephrases the user's question before retrieval.
LLMQueryExtension — generates several sub-queries and merges their results.
LLMReRanker — re-orders the top-k candidates by asking an LLM to score them.

7. Glossary — the words you'll see everywhere#

Term	One-line definition	First cookbook that uses it
`Document`	The toolkit's top-level object after extraction; carries metadata and a list of `pages`.	01-quickstart
`DocumentChunk`	A small slice of a `Document` produced by the splitter; the unit of retrieval.	01-quickstart
Embedding	A fixed-length float vector representing the meaning of a chunk or query.	01-quickstart
BM25	Classical keyword-based relevance score; great for exact terms.	01-quickstart (under hybrid)
ANN	Approximate Nearest-Neighbour search over embeddings; the "vector" half of hybrid.	01-quickstart (under hybrid)
Hybrid search	A single query that combines BM25 + ANN via RRF.	01-quickstart
RRF	Reciprocal Rank Fusion; the formula that fuses two ranked lists into one.	01-quickstart (under hybrid)
`Pipeline`	The ingestion orchestrator: loader → extractor → splitter → embedder → store.	01-quickstart
`QueryEngine`	The search orchestrator: retrievers (+ optional rewriter/reranker).	01-quickstart
`ChunkEnricher`	A hook that attaches custom metadata to each chunk before embedding.	02-advanced-indexing
Checkpointing	Recording which documents are already indexed so a crash doesn't re-ingest them.	02-advanced-indexing
`LLMQueryRewriter`	Rephrases the user's question with an LLM before retrieval.	03-advanced-search
`LLMQueryExtension`	Decomposes one query into N sub-queries and merges their results.	03-advanced-search
`LLMReRanker`	Re-orders top-k results by asking an LLM to score them.	03-advanced-search
Match phase / Ranking phase	Vespa's two-stage retrieval: candidate selection vs. ordering.	04-evaluation
Precision@k / Recall@k / F1@k	Standard retrieval-quality metrics, computed against a ground-truth dataset.	04-evaluation

8. Where to go next#

You now have the mental model. Run the next cookbooks in order.