Search Toolkit

Search Toolkit is a Python framework for building production-ready Information Retrieval (IR) systems. It provides plug-and-play, extendable components for ingestion, retrieval, and evaluation — with backend-agnostic support and every component swappable.

LLMs are not trained on your private data. To ground their answers in your documents you need a retrieval pipeline that ingests, indexes, and searches your content. Search Toolkit gives you every building block to assemble that pipeline.

Key features

Ingestion

Multi-format extraction: PDF/DOCX/PPTX via Mistral OCR, HTML, spreadsheets, emails, plain text
File loading: Load from the local filesystem or implement custom loaders for any source
Flexible chunking: Character, token, markdown-aware, or separator-based splitting
Enrichment: Enrich documents and chunks with custom metadata or LLM-generated summaries
Indexing: Index to vector stores for semantic search

Retrieval

Multiple strategies: Vector (semantic) search with optional reranking
Query preprocessing: Improve user queries with LLM reformulation or query extension
Reranking: LLM reranker, cross-encoder reranker, or custom rerankers
Semantic caching: Cache results by query similarity to skip redundant retrieval

Architecture

Ingestion transforms raw documents into searchable chunks. A FileLoader reads raw bytes from a source, a DocumentExtractor converts them into a structured document, a TextSplitter divides it into chunks, an optional ChunkEnricher adds metadata, and an Embedder produces vectors for indexing into a vector store.

Retrieval finds relevant chunks for a given query. An optional query preprocessor rewrites or expands the query, a Retriever searches the vector index, and an optional Reranker re-scores the results before returning them.

Both workflows are orchestrated by high-level classes (Pipeline for ingestion, QueryEngine for retrieval) that handle component wiring and execution. Every component is swappable — use the built-in implementations or bring your own.

Components

Component	Built-in options
File Loaders	`FilesystemFileLoader`, custom loaders
Extractors	`MistralOCRExtractor`, `PlainTextExtractor`, `HTMLExtractor`, `SpreadsheetExtractor`, `EmailExtractor`, `NumbersExtractor`, `LegacyOfficeExtractor`
Text Splitters	`CharacterTextSplitter`, `TokenTextSplitter`, `MarkdownTextSplitter`, `SeparatorTextSplitter`
Enrichers	`SummaryEnricher`, custom `ChunkEnricher`
Embedders	`MistralEmbedder`, custom `Embedder`
Storage	Vespa or custom vector store
Retrievers	`VectorRetriever`
Rerankers	`LLMReRanker`, `CrossEncoderReRanker`, `RRFRanker`
Preprocessing	`LLMQueryRewriter`, `LLMQueryExtension`
Caching	`SemanticCache` with `InMemoryCacheBackend`

Installation & Extras

Install the core package:

uv add mistralai-search-toolkit

uv add mistralai-search-toolkit

Optional extras add specialized functionality:

Extra	Description
`vespa`	Vespa plugin for vector storage and semantic search
`extractor-pymupdf`	Advanced PDF extraction with PyMuPDF Pro
`extractor-spreadsheet`	Spreadsheet parsing (Excel, CSV, Calamine format)
`extractor-email`	Email file parsing (EML, MSG formats)
`html-converter-markdownify`	Convert HTML to Markdown
`text-splitter-langchain`	Additional text splitting strategies via LangChain
`storage-gcs`	Google Cloud Storage integration
`storage-azure`	Azure Blob Storage integration
`all`	All optional extras

Install extras with the core package:

uv add "mistralai-search-toolkit[vespa]"
uv add "mistralai-search-toolkit[vespa,extractor-pymupdf]"
uv add "mistralai-search-toolkit[all]"  # Install all extras

uv add "mistralai-search-toolkit[vespa]"
uv add "mistralai-search-toolkit[vespa,extractor-pymupdf]"
uv add "mistralai-search-toolkit[all]"  # Install all extras

All packages are available on PyPI.

Information

Requires Python 3.12+. We recommend using uv for dependency management.

Next steps

Quickstart — build your first ingestion and retrieval pipeline end to end.
Search index — set up your vector store.
Ingestion — load, extract, chunk, enrich, and index your documents.
Retrieval — configure vector search with optional reranking.