Chunk enrichers add custom metadata to chunks during ingestion. Use them to attach information from external sources, classifications, tags, or any computed metadata.

Available chunk enrichers

Available chunk enrichers

EnricherPurpose
Summary EnricherGenerate document summaries using an LLM
Custom EnrichersAdd custom metadata from any source
Summary Enricher

Summary Enricher

SummaryEnricher generates a document summary using an LLM and optionally injects it into chunks and/or document metadata. This can improve retrieval by giving each chunk context about the full document.

By default, SummaryEnricher is non-breaking: if summary generation fails, it logs the failure and returns the original chunks unchanged.

Requirements:

  • A Mistral API key

Usage:

import os

from mistralai.client import Mistral
from mistralai.search.toolkit.ingestion.enrichment import SummaryEnricher, SummaryConfig, SummarizeRequestConfig
from mistralai.search.toolkit.llm import MistralChat, LLMConfig

# Create LLM provider
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"))
llm = MistralChat(client=mistral_client, config=LLMConfig(model="mistral-small-latest"))

# Create enricher with default settings
enricher = SummaryEnricher(llm_provider=llm)

# Or customize the summary behavior
enricher = SummaryEnricher(
    llm_provider=llm,
    summary_config=SummaryConfig(
        request_config=SummarizeRequestConfig(
            prompt="Summarize this document in 3 sentences.",
            max_tokens=256,
        ),
    ),
)

Parameters:

ParameterTypeDefaultDescription
llm_providerChatLLMProvider(required)LLM provider for summarization
summary_configSummaryConfig | NoneNoneConfiguration object (see below)

SummarizeRequestConfig:

Controls the LLM request for summary generation.

ParameterTypeDefaultDescription
modelstr"mistral-small-latest"Model to use for summarization
promptstr"Summarize the document in less than 5 lines."Summarization prompt
max_tokensint256Maximum tokens in the summary
truncate_atint | None32768Truncate document content before sending to LLM
temperaturefloat | None0.6LLM temperature

SummaryRequestOptions:

Controls how the summary is injected into the pipeline output.

ParameterTypeDefaultDescription
include_summary_chunkboolTrueAdd a dedicated summary chunk to the chunk list
propagate_summary_to_chunksboolFalsePrepend the summary to every chunk's content
populate_document_metadataboolTrueStore the summary in the document's metadata
fail_on_generation_errorboolFalseRaise on summary generation failure instead of logging and continuing
Creating custom enrichers

Creating custom enrichers

Implement the ChunkEnricher interface to add custom metadata:

from mistralai.search.toolkit.ingestion.enrichment import ChunkEnricher
from mistralai.search.toolkit.document import DocumentChunk, Document

class EntityTagger(ChunkEnricher):
    """Add entity tags to chunks."""

    async def enrich_chunks(
        self, chunks: list[DocumentChunk], document: Document, concurrency: int = 10
    ) -> tuple[list[DocumentChunk], Document]:
        enriched = []
        for chunk in chunks:
            entities = await self._extract_entities(chunk.content)
            updated_metadata = chunk.metadata.model_copy(update={"entities": entities})
            enriched.append(chunk.model_copy(update={"metadata": updated_metadata}))

        return enriched, document

    async def _extract_entities(self, text: str) -> list[str]:
        entities = []
        # ... entity extraction logic ...
        return entities

enricher = EntityTagger()
enriched_chunks, document = await enricher.enrich_chunks(chunks, document)

for chunk in enriched_chunks:
    print(f"Entities: {chunk.metadata.get('entities', [])}")

Batch enrichment patterns

For performance with many chunks, use concurrency and batch external API calls:

import asyncio
from mistralai.search.toolkit.ingestion.enrichment import ChunkEnricher
from mistralai.search.toolkit.document import DocumentChunk, Document

class BatchEnricher(ChunkEnricher):
    """Batch API calls for efficiency."""

    async def enrich_chunks(
        self, chunks: list[DocumentChunk], document: Document, concurrency: int = 10
    ) -> tuple[list[DocumentChunk], Document]:
        # Use concurrency limit for external API calls
        semaphore = asyncio.Semaphore(concurrency)

        async def enrich_one(chunk: DocumentChunk) -> DocumentChunk:
            async with semaphore:
                metadata = await self._fetch_metadata(chunk.content)
                return chunk.model_copy(
                    update={"metadata": chunk.metadata.model_copy(update=metadata)}
                )

        enriched = await asyncio.gather(
            *[enrich_one(c) for c in chunks]
        )
        return enriched, document

    async def _fetch_metadata(self, text: str) -> dict:
        # Call external API with batching/retries
        ...

Combining multiple enrichers

from mistralai.search.toolkit.ingestion.pipelines import Pipeline

pipeline = Pipeline(
    loader=loader,
    extractor=extractor,
    text_splitter=splitter,
    embedder=embedder,
    vector_store=vector_store,
    chunk_enrichers=[
        SummaryEnricher(llm_provider=llm),
        EntityTagger(),
        CustomMetadataEnricher(),
    ],
)

Enrichers are applied sequentially in the order listed. Each enricher receives the output of the previous one.