Text splitters divide documents into retrievable DocumentChunk objects. The choice of splitter significantly impacts retrieval quality.

Available text splitters

Available text splitters

SplitterBest for
Character Text SplitterSimple text, quick prototyping
Token Text SplitterToken-aware chunking, LLM context management
Markdown Text SplitterMarkdown documents, structured content with headers
Separator Text SplitterCustom splitting logic, hierarchical text
Custom SplittersAny splitting strategy
i
Information

All splitters automatically preserve important metadata in each chunk:

  • page_number: Source page number
  • filename: Original filename
  • filepath: Original file path
  • start_offset: Character position in document
  • end_offset: Character position in document
  • images: List of image references in chunk (if any)
Chunk size guidance

Chunk size guidance

Chunk size significantly impacts retrieval quality and LLM context usage. Choose based on your use case:

LLM ContextChunk SizeOverlapUse case
4k tokens300-500 chars50-100 charsMemory-constrained, fast retrieval
8k tokens500-1000 chars100-200 charsBalanced retrieval quality
32k+ tokens1000-2000 chars200-500 charsRich context, complex retrieval
Code/technical500-1000 chars100-200 charsPreserve logical units
Legal/financial1000-2000 chars200-500 charsFull context for interpretation

Rule of thumb: ~400-600 characters ≈ 100-150 tokens. Adjust based on domain complexity and retrieval precision needs.

Chunk overlap benefits:

  • Prevents semantic boundaries from splitting related concepts
  • Improves context when chunks are used individually
  • Increases index size and query latency proportionally
Character Text Splitter

Character Text Splitter

CharacterTextSplitter splits text by character count. Best for simple text and quick prototyping.

Example:

from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,     # Characters per chunk
)
chunks = splitter.split_document(document)

Parameters:

ParameterTypeDefaultDescription
chunk_sizeint1000Maximum number of characters per chunk
Token Text Splitter

Token Text Splitter

Token-aware splitting using a tokenizer for LLM context window management and precise token control.

Example:

from mistralai.search.toolkit.ingestion.text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=1000,         # Tokens per chunk
    chunk_overlap=200,       # Token overlap
)

chunks = splitter.split_document(document)

Configuration options:

OptionTypeDefaultPurpose
chunk_sizeint1000Tokens per chunk (uses Mistral tokenizer)
chunk_overlapint0Tokens of overlap between chunks
tokenizer_modelstr"mistral"Tokenizer to use for counting

When to use:

  • Precise token budgeting for LLM context windows
  • Consistent chunks across different text densities
  • When working with specific token limits

MarkdownTextSplitter

MarkdownTextSplitter splits markdown documents at header boundaries. It inherits from SeparatorTextSplitter and adds header-aware splitting with configurable header levels.

Requirements:

Install the text-splitter-langchain extra:

uv add "mistralai-search-toolkit[text-splitter-langchain]"

Example:

from mistralai.search.toolkit.ingestion.text_splitters import MarkdownTextSplitter, MarkdownTextSplitterConfig

config = MarkdownTextSplitterConfig(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ],
    strip_headers=False,    # Keep headers in chunks
    chunk_size=1000,        # Inherited from SeparatorTextSplitterConfig
    chunk_overlap=200,      # Inherited from SeparatorTextSplitterConfig
)

splitter = MarkdownTextSplitter(config=config)
chunks = splitter.split_document(document)

Parameters (MarkdownTextSplitterConfig):

MarkdownTextSplitterConfig extends SeparatorTextSplitterConfig, adding:

ParameterTypeDefaultDescription
headers_to_split_onlist[tuple[str, str]][("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]Markdown headers to split on — each tuple is (header_prefix, label)
strip_headersboolFalseRemove header lines from chunk content

All parameters from SeparatorTextSplitterConfig are also available (chunk_size, chunk_max_size, chunk_overlap, chunk_separators, keep_separator, strip_whitespace).

SeparatorTextSplitter

Recursive splitting with custom separators for handling custom document structures with precise control over split points and hierarchical text organization.

Example:

from mistralai.search.toolkit.ingestion.text_splitters import SeparatorTextSplitter, SeparatorTextSplitterConfig

config = SeparatorTextSplitterConfig(
    chunk_size=1000,          # Target size (characters)
    chunk_max_size=1500,      # Maximum size (characters)
    chunk_overlap=200,        # Overlap (characters)
    chunk_separators=[        # Tried in order
        "\n\n",    # Paragraphs first
        "\n",      # Then lines
        ". ",      # Then sentences
        " ",       # Then words
        "",        # Finally characters
    ],
)

splitter = SeparatorTextSplitter(config=config)
chunks = splitter.split_document(document)

Features:

  • Tries separators in order
  • Respects max size even if splitting fails
  • Merges small chunks when possible
  • Returns start/end offsets
Creating custom splitters

Creating custom splitters

Implement the TextSplitter protocol:

from mistralai.search.toolkit.ingestion.text_splitters import TextSplitter
from mistralai.search.toolkit.document import DocumentChunk

class ParagraphSplitter(TextSplitter):
    """Split text into chunks by paragraphs."""

    def __init__(self, max_paragraphs_per_chunk: int = 3):
        self.max_paragraphs_per_chunk = max_paragraphs_per_chunk

    def split_text(self, text: str, **kwargs) -> list[DocumentChunk]:
        paragraphs = text.split("\n\n")
        chunks = []
        offset = 0

        for i in range(0, len(paragraphs), self.max_paragraphs_per_chunk):
            chunk_paragraphs = paragraphs[i:i + self.max_paragraphs_per_chunk]
            content = "\n\n".join(chunk_paragraphs)

            chunks.append(DocumentChunk(
                content=content,
                start_offset=offset,
                end_offset=offset + len(content),
            ))
            offset += len(content) + 2

        return chunks

splitter = ParagraphSplitter(max_paragraphs_per_chunk=3)
chunks = splitter.split_text("First paragraph.\n\nSecond paragraph.\n\nThird paragraph.")