Search Toolkit
Search Toolkit is a Python framework for building production-ready information retrieval (IR) systems. It provides components for ingestion, retrieval, and evaluation. It works across backends, and every component can be swapped.
LLMs are not trained on your private data. To ground their answers in your documents you need a retrieval pipeline that ingests, indexes, and searches your content. Search Toolkit gives you every building block to assemble that pipeline.
Key features
Ingestion
Ingestion includes:
- Multi-format extraction: PDF/DOCX/PPTX via Mistral OCR, HTML, spreadsheets, emails, plain text
- File loading: Load from the local filesystem or implement custom loaders for any source
- Flexible chunking: Character, token, markdown-aware, or separator-based splitting
- Enrichment: Enrich documents and chunks with custom metadata or LLM-generated summaries
- Indexing: Index to vector stores for semantic search
Retrieval
Retrieval includes:
- Multiple strategies: Vector (semantic) search with optional reranking
- Query preprocessing: Improve user queries with LLM reformulation or query extension
- Reranking: LLM reranker, cross-encoder reranker, or custom rerankers
- Semantic caching: Cache results by query similarity to skip redundant retrieval
Architecture
Ingestion transforms raw documents into searchable chunks. A FileLoader reads raw bytes from a source, a DocumentExtractor converts them into a structured document, a TextSplitter divides it into chunks, an optional ChunkEnricher adds metadata, and an Embedder produces vectors for indexing into a vector store.
Retrieval finds relevant chunks for a given query. An optional query preprocessor rewrites or expands the query, a Retriever searches the vector index, and an optional Reranker re-scores the results before returning them.
Both workflows are orchestrated by high-level classes (Pipeline for ingestion, QueryEngine for retrieval) that handle component wiring and execution. Every component can be swapped: use the built-in implementations or bring your own.
Components
| Component | Built-in options |
|---|---|
| File Loaders | FilesystemFileLoader, custom loaders |
| Extractors | MistralOCRExtractor, PlainTextExtractor, HTMLExtractor, SpreadsheetExtractor, EmailExtractor, NumbersExtractor, LegacyOfficeExtractor |
| Text Splitters | CharacterTextSplitter, TokenTextSplitter, MarkdownTextSplitter, SeparatorTextSplitter |
| Enrichers | SummaryEnricher, custom ChunkEnricher |
| Embedders | MistralEmbedder, custom Embedder |
| Storage | Vespa or custom vector store |
| Retrievers | VectorRetriever |
| Rerankers | LLMReRanker, CrossEncoderReRanker, RRFRanker |
| Preprocessing | LLMQueryRewriter, LLMQueryExtension |
| Caching | SemanticCache with InMemoryCacheBackend |
Installation & Extras
Install the core package:
uv add mistralai-search-toolkitOptional extras add specialized functionality:
| Extra | Description |
|---|---|
vespa | Vespa plugin for vector storage and semantic search |
extractor-pymupdf | Advanced PDF extraction with PyMuPDF Pro |
extractor-spreadsheet | Spreadsheet parsing (Excel, CSV, Calamine format) |
extractor-email | Email file parsing (EML, MSG formats) |
html-converter-markdownify | Convert HTML to Markdown |
text-splitter-langchain | Additional text splitting strategies via LangChain |
storage-gcs | Google Cloud Storage integration |
storage-azure | Azure Blob Storage integration |
all | All optional extras |
Install extras with the core package:
uv add "mistralai-search-toolkit[vespa]"
uv add "mistralai-search-toolkit[vespa,extractor-pymupdf]"
uv add "mistralai-search-toolkit[all]" # Install all extrasAll packages are available on PyPI.
Requires Python 3.12+. We recommend using uv for dependency management.
Next steps
- Quickstart: build your first ingestion and retrieval pipeline end to end.
- Search index: set up your vector store.
- Ingestion: load, extract, chunk, enrich, and index your documents.
- Retrieval: configure vector search with optional reranking.