Search Toolkit

Search Toolkit is a Python framework for building production-ready Information Retrieval (IR) systems. It provides plug-and-play, extendable components for ingestion, retrieval, and evaluation — with backend-agnostic support and every component swappable.

LLMs are not trained on your private data. To ground their answers in your documents you need a retrieval pipeline that ingests, indexes, and searches your content. Search Toolkit gives you every building block to assemble that pipeline.

Key features

Key features

Ingestion

  • Multi-format extraction: PDF/DOCX/PPTX via Mistral OCR, HTML, spreadsheets, emails, plain text
  • File loading: Load from the local filesystem or implement custom loaders for any source
  • Flexible chunking: Character, token, markdown-aware, or separator-based splitting
  • Enrichment: Enrich documents and chunks with custom metadata or LLM-generated summaries
  • Indexing: Index to vector stores for semantic search

Retrieval

  • Multiple strategies: Vector (semantic) search with optional reranking
  • Query preprocessing: Improve user queries with LLM reformulation or query extension
  • Reranking: LLM reranker, cross-encoder reranker, or custom rerankers
  • Semantic caching: Cache results by query similarity to skip redundant retrieval
Architecture

Architecture

Ingestion transforms raw documents into searchable chunks. A FileLoader reads raw bytes from a source, a DocumentExtractor converts them into a structured document, a TextSplitter divides it into chunks, an optional ChunkEnricher adds metadata, and an Embedder produces vectors for indexing into a vector store.

Retrieval finds relevant chunks for a given query. An optional query preprocessor rewrites or expands the query, a Retriever searches the vector index, and an optional Reranker re-scores the results before returning them.

Both workflows are orchestrated by high-level classes (Pipeline for ingestion, QueryEngine for retrieval) that handle component wiring and execution. Every component is swappable — use the built-in implementations or bring your own.

Components

ComponentBuilt-in options
File LoadersFilesystemFileLoader, custom loaders
ExtractorsMistralOCRExtractor, PlainTextExtractor, HTMLExtractor, SpreadsheetExtractor, EmailExtractor, NumbersExtractor, LegacyOfficeExtractor
Text SplittersCharacterTextSplitter, TokenTextSplitter, MarkdownTextSplitter, SeparatorTextSplitter
EnrichersSummaryEnricher, custom ChunkEnricher
EmbeddersMistralEmbedder, custom Embedder
StorageVespa or custom vector store
RetrieversVectorRetriever
RerankersLLMReRanker, CrossEncoderReRanker, RRFRanker
PreprocessingLLMQueryRewriter, LLMQueryExtension
CachingSemanticCache with InMemoryCacheBackend
Installation & Extras

Installation & Extras

Install the core package:

uv add mistralai-search-toolkit

Optional extras add specialized functionality:

ExtraDescription
vespaVespa plugin for vector storage and semantic search
extractor-pymupdfAdvanced PDF extraction with PyMuPDF Pro
extractor-spreadsheetSpreadsheet parsing (Excel, CSV, Calamine format)
extractor-emailEmail file parsing (EML, MSG formats)
html-converter-markdownifyConvert HTML to Markdown
text-splitter-langchainAdditional text splitting strategies via LangChain
storage-gcsGoogle Cloud Storage integration
storage-azureAzure Blob Storage integration
allAll optional extras

Install extras with the core package:

uv add "mistralai-search-toolkit[vespa]"
uv add "mistralai-search-toolkit[vespa,extractor-pymupdf]"
uv add "mistralai-search-toolkit[all]"  # Install all extras

All packages are available on PyPI.

i
Information

Requires Python 3.12+. We recommend using uv for dependency management.

Next steps

Next steps

  • Quickstart — build your first ingestion and retrieval pipeline end to end.
  • Search index — set up your vector store.
  • Ingestion — load, extract, chunk, enrich, and index your documents.
  • Retrieval — configure vector search with optional reranking.