Search Toolkit
Search Toolkit is a Python framework for building production-ready Information Retrieval (IR) systems. It provides plug-and-play, extendable components for ingestion, retrieval, and evaluation — with backend-agnostic support and every component swappable.
LLMs are not trained on your private data. To ground their answers in your documents you need a retrieval pipeline that ingests, indexes, and searches your content. Search Toolkit gives you every building block to assemble that pipeline.
Key features
Ingestion
- Multi-format extraction: PDF/DOCX/PPTX via Mistral OCR, HTML, spreadsheets, emails, plain text
- File loading: Load from the local filesystem or implement custom loaders for any source
- Flexible chunking: Character, token, markdown-aware, or separator-based splitting
- Enrichment: Enrich documents and chunks with custom metadata or LLM-generated summaries
- Indexing: Index to vector stores for semantic search
Retrieval
- Multiple strategies: Vector (semantic) search with optional reranking
- Query preprocessing: Improve user queries with LLM reformulation or query extension
- Reranking: LLM reranker, cross-encoder reranker, or custom rerankers
- Semantic caching: Cache results by query similarity to skip redundant retrieval
Architecture
Ingestion transforms raw documents into searchable chunks. A FileLoader reads raw bytes from a source, a DocumentExtractor converts them into a structured document, a TextSplitter divides it into chunks, an optional ChunkEnricher adds metadata, and an Embedder produces vectors for indexing into a vector store.
Retrieval finds relevant chunks for a given query. An optional query preprocessor rewrites or expands the query, a Retriever searches the vector index, and an optional Reranker re-scores the results before returning them.
Both workflows are orchestrated by high-level classes (Pipeline for ingestion, QueryEngine for retrieval) that handle component wiring and execution. Every component is swappable — use the built-in implementations or bring your own.
Components
| Component | Built-in options |
|---|---|
| File Loaders | FilesystemFileLoader, custom loaders |
| Extractors | MistralOCRExtractor, PlainTextExtractor, HTMLExtractor, SpreadsheetExtractor, EmailExtractor, NumbersExtractor, LegacyOfficeExtractor |
| Text Splitters | CharacterTextSplitter, TokenTextSplitter, MarkdownTextSplitter, SeparatorTextSplitter |
| Enrichers | SummaryEnricher, custom ChunkEnricher |
| Embedders | MistralEmbedder, custom Embedder |
| Storage | Vespa or custom vector store |
| Retrievers | VectorRetriever |
| Rerankers | LLMReRanker, CrossEncoderReRanker, RRFRanker |
| Preprocessing | LLMQueryRewriter, LLMQueryExtension |
| Caching | SemanticCache with InMemoryCacheBackend |
Installation & Extras
Install the core package:
uv add mistralai-search-toolkitOptional extras add specialized functionality:
| Extra | Description |
|---|---|
vespa | Vespa plugin for vector storage and semantic search |
extractor-pymupdf | Advanced PDF extraction with PyMuPDF Pro |
extractor-spreadsheet | Spreadsheet parsing (Excel, CSV, Calamine format) |
extractor-email | Email file parsing (EML, MSG formats) |
html-converter-markdownify | Convert HTML to Markdown |
text-splitter-langchain | Additional text splitting strategies via LangChain |
storage-gcs | Google Cloud Storage integration |
storage-azure | Azure Blob Storage integration |
all | All optional extras |
Install extras with the core package:
uv add "mistralai-search-toolkit[vespa]"
uv add "mistralai-search-toolkit[vespa,extractor-pymupdf]"
uv add "mistralai-search-toolkit[all]" # Install all extrasAll packages are available on PyPI.
Requires Python 3.12+. We recommend using uv for dependency management.
Next steps
- Quickstart — build your first ingestion and retrieval pipeline end to end.
- Search index — set up your vector store.
- Ingestion — load, extract, chunk, enrich, and index your documents.
- Retrieval — configure vector search with optional reranking.