Ingestion

Ingestion transforms raw documents into searchable chunks indexed in a vector store. The pipeline is modular — swap any component based on your needs.

A FileLoader reads raw bytes from a source (local filesystem, cloud storage, or any custom source) and returns a File object.

A DocumentExtractor converts that File into a structured Document — parsing PDFs via OCR, converting HTML to Markdown, or reading plain text depending on the file type.

A TextSplitter divides the Document into smaller DocumentChunk objects. The strategy and chunk size directly affect retrieval quality.

An optional ChunkEnricher adds metadata to each chunk before indexing — for example, an LLM-generated summary or entity tags.

An Embedder converts each chunk into a vector, and the store persists both the vector and the raw text for search.

Pipeline

Pipeline

Pipeline is the main entry point. It handles file loading, extraction, chunking, embedding, and indexing. It supports batch processing with concurrency, progress tracking, and checkpointing.

from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.client import Mistral
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.search.toolkit.ingestion.extractors import PlainTextExtractor
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

mistral_client = Mistral(api_key="...")

collection_name = "my_collection"
config = VespaClientConfig(
    endpoint="http://localhost:8080",
)

vector_store = app.get_search_index(config, collection_name=collection_name)

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=PlainTextExtractor(),
    text_splitter=CharacterTextSplitter(chunk_size=512),
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
    vector_store=vector_store,
)

num_chunks = await pipeline.run(
    documents=["doc1.txt", "doc2.txt"],
    collection_name="my_collection",
)
i
Information

collection_name acts as a namespace for your indexed chunks. Use the same name when retrieving documents later.

Checkpointing

For large document collections, enable checkpointing to recover from failures:

from mistralai.search.toolkit.ingestion.progress import create_tqdm_progress_callback
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=extractor,
    text_splitter=splitter,
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=embedder,
    vector_store=vector_store,
    checkpoint_dir="./checkpoints",
)

num_chunks = await pipeline.run(
    documents=document_paths,
    collection_name="my_collection",
    use_checkpoint=True,
    progress_callback=create_tqdm_progress_callback(),
)

On restart, documents with existing checkpoint files are skipped.

RoutedPipeline

RoutedPipeline automatically routes files to protocol-specific pipelines based on file extension and MIME type. Use it when ingesting mixed-format document collections — it intelligently selects the right extractor and processing pipeline for each file type.

Protocols

RoutedPipeline ships with 9 built-in protocols. Each protocol maps a set of file types to the appropriate extractor pipeline:

ProtocolFile types
ocrPDF, DOCX, PPTX, ODT, EPUB
plain_textTXT, MD, CSV, JS, PY, JSON, YAML, ...
htmlHTML, HTM
xlsxXLSX, XLS, ODS
numbersNUMBERS
legacy_officeDOC, PPT, HWP, HWPX
emailEML, MSG
imagePNG, JPEG, GIF, WebP
audioMP3, WAV, M4A, FLAC, OGG

Example

Basic usage with default protocols:

from mistralai.search.toolkit.ingestion.pipelines import RoutedPipeline

router = RoutedPipeline({
    "ocr": ocr_pipeline,
    "html": html_pipeline,
    "plain_text": text_pipeline,
    "xlsx": spreadsheet_pipeline,
})

document = await router.run_file(file=file)

Parameters

ParameterDescription
pipelinesDictionary mapping protocol names to Pipeline instances
mime_registryOptional custom MIME registry for file-type detection
protocol_overridesOptional dictionary to override the default protocol for specific file extensions

Override Example

Force specific file extensions to use a different protocol:

router = RoutedPipeline(
    pipelines={
        "ocr": ocr_pipeline,
        "html": html_pipeline,
        "plain_text": text_pipeline,
        "xlsx": spreadsheet_pipeline,
    },
    protocol_overrides={
        ".doc": "ocr",  # Force legacy docs through OCR instead of legacy_office
    },
)

document = await router.run_file(file=file)
Components

Components

Each ingestion component is documented in detail with configuration options, best practices, and examples:

  • File loaders — Load files from filesystem, cloud storage, or custom sources
  • Document extractors — Extract content from PDFs, HTML, spreadsheets, and more (7 formats supported)
  • Text splitters — Divide documents into chunks with optimal size and overlap
  • Chunk enrichers — Add custom metadata for filtering and ranking
  • Embedders — Convert text to vectors for semantic search
  • Search index — Store and search over indexed chunks