Ingestion

Ingestion transforms raw documents into searchable chunks indexed in a vector store. The pipeline is modular — swap any component based on your needs.

A FileLoader reads raw bytes from a source (local filesystem, cloud storage, or any custom source) and returns a File object.

A DocumentExtractor converts that File into a structured Document — parsing PDFs via OCR, converting HTML to Markdown, or reading plain text depending on the file type.

A TextSplitter divides the Document into smaller DocumentChunk objects. The strategy and chunk size directly affect retrieval quality.

An optional ChunkEnricher adds metadata to each chunk before indexing — for example, an LLM-generated summary or entity tags.

An Embedder converts each chunk into a vector, and the store persists both the vector and the raw text for search.

Pipeline

Pipeline is the main entry point. It handles file loading, extraction, chunking, embedding, and indexing. It supports batch processing with concurrency, progress tracking, and checkpointing.

from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.client import Mistral
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.search.toolkit.ingestion.extractors import PlainTextExtractor
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

mistral_client = Mistral(api_key="...")

collection_name = "my_collection"
config = VespaClientConfig(
    endpoint="http://localhost:8080",
)

vector_store = app.get_search_index(config, collection_name=collection_name)

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=PlainTextExtractor(),
    text_splitter=CharacterTextSplitter(chunk_size=512),
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
    vector_store=vector_store,
)

num_chunks = await pipeline.run(
    documents=["doc1.txt", "doc2.txt"],
    collection_name="my_collection",
)

from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.client import Mistral
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.search.toolkit.ingestion.extractors import PlainTextExtractor
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

mistral_client = Mistral(api_key="...")

collection_name = "my_collection"
config = VespaClientConfig(
    endpoint="http://localhost:8080",
)

vector_store = app.get_search_index(config, collection_name=collection_name)

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=PlainTextExtractor(),
    text_splitter=CharacterTextSplitter(chunk_size=512),
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
    vector_store=vector_store,
)

num_chunks = await pipeline.run(
    documents=["doc1.txt", "doc2.txt"],
    collection_name="my_collection",
)

Information

collection_name acts as a namespace for your indexed chunks. Use the same name when retrieving documents later.

Checkpointing

For large document collections, enable checkpointing to recover from failures:

from mistralai.search.toolkit.ingestion.progress import create_tqdm_progress_callback
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=extractor,
    text_splitter=splitter,
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=embedder,
    vector_store=vector_store,
    checkpoint_dir="./checkpoints",
)

num_chunks = await pipeline.run(
    documents=document_paths,
    collection_name="my_collection",
    use_checkpoint=True,
    progress_callback=create_tqdm_progress_callback(),
)

from mistralai.search.toolkit.ingestion.progress import create_tqdm_progress_callback
from mistralai.search.toolkit.ingestion.enrichers import SummaryEnricher

pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    extractor=extractor,
    text_splitter=splitter,
    chunk_enrichers=[SummaryEnricher(client=mistral_client)],
    embedder=embedder,
    vector_store=vector_store,
    checkpoint_dir="./checkpoints",
)

num_chunks = await pipeline.run(
    documents=document_paths,
    collection_name="my_collection",
    use_checkpoint=True,
    progress_callback=create_tqdm_progress_callback(),
)

On restart, documents with existing checkpoint files are skipped.

RoutedPipeline

RoutedPipeline automatically routes files to protocol-specific pipelines based on file extension and MIME type. Use it when ingesting mixed-format document collections — it intelligently selects the right extractor and processing pipeline for each file type.

Protocols

RoutedPipeline ships with 9 built-in protocols. Each protocol maps a set of file types to the appropriate extractor pipeline:

Protocol	File types
`ocr`	PDF, DOCX, PPTX, ODT, EPUB
`plain_text`	TXT, MD, CSV, JS, PY, JSON, YAML, ...
`html`	HTML, HTM
`xlsx`	XLSX, XLS, ODS
`numbers`	NUMBERS
`legacy_office`	DOC, PPT, HWP, HWPX
`email`	EML, MSG
`image`	PNG, JPEG, GIF, WebP
`audio`	MP3, WAV, M4A, FLAC, OGG

Example

Basic usage with default protocols:

from mistralai.search.toolkit.ingestion.pipelines import RoutedPipeline

router = RoutedPipeline({
    "ocr": ocr_pipeline,
    "html": html_pipeline,
    "plain_text": text_pipeline,
    "xlsx": spreadsheet_pipeline,
})

document = await router.run_file(file=file)

from mistralai.search.toolkit.ingestion.pipelines import RoutedPipeline

router = RoutedPipeline({
    "ocr": ocr_pipeline,
    "html": html_pipeline,
    "plain_text": text_pipeline,
    "xlsx": spreadsheet_pipeline,
})

document = await router.run_file(file=file)

Parameters

Parameter	Description
`pipelines`	Dictionary mapping protocol names to `Pipeline` instances
`mime_registry`	Optional custom MIME registry for file-type detection
`protocol_overrides`	Optional dictionary to override the default protocol for specific file extensions

Override Example

Force specific file extensions to use a different protocol:

router = RoutedPipeline(
    pipelines={
        "ocr": ocr_pipeline,
        "html": html_pipeline,
        "plain_text": text_pipeline,
        "xlsx": spreadsheet_pipeline,
    },
    protocol_overrides={
        ".doc": "ocr",  # Force legacy docs through OCR instead of legacy_office
    },
)

document = await router.run_file(file=file)

router = RoutedPipeline(
    pipelines={
        "ocr": ocr_pipeline,
        "html": html_pipeline,
        "plain_text": text_pipeline,
        "xlsx": spreadsheet_pipeline,
    },
    protocol_overrides={
        ".doc": "ocr",  # Force legacy docs through OCR instead of legacy_office
    },
)

document = await router.run_file(file=file)

Components

Each ingestion component is documented in detail with configuration options, best practices, and examples:

File loaders — Load files from filesystem, cloud storage, or custom sources
Document extractors — Extract content from PDFs, HTML, spreadsheets, and more (7 formats supported)
Text splitters — Divide documents into chunks with optimal size and overlap
Chunk enrichers — Add custom metadata for filtering and ranking
Embedders — Convert text to vectors for semantic search
Search index — Store and search over indexed chunks