Text splitters divide documents into retrievable DocumentChunk objects. The choice of splitter significantly impacts retrieval quality.
Available text splitters
| Splitter | Best for |
|---|---|
| Character Text Splitter | Simple text, quick prototyping |
| Token Text Splitter | Token-aware chunking, LLM context management |
| Markdown Text Splitter | Markdown documents, structured content with headers |
| Separator Text Splitter | Custom splitting logic, hierarchical text |
| Custom Splitters | Any splitting strategy |
All splitters automatically preserve important metadata in each chunk:
page_number: Source page numberfilename: Original filenamefilepath: Original file pathstart_offset: Character position in documentend_offset: Character position in documentimages: List of image references in chunk (if any)
Chunk size guidance
Chunk size significantly impacts retrieval quality and LLM context usage. Choose based on your use case:
| LLM Context | Chunk Size | Overlap | Use case |
|---|---|---|---|
| 4k tokens | 300-500 chars | 50-100 chars | Memory-constrained, fast retrieval |
| 8k tokens | 500-1000 chars | 100-200 chars | Balanced retrieval quality |
| 32k+ tokens | 1000-2000 chars | 200-500 chars | Rich context, complex retrieval |
| Code/technical | 500-1000 chars | 100-200 chars | Preserve logical units |
| Legal/financial | 1000-2000 chars | 200-500 chars | Full context for interpretation |
Rule of thumb: ~400-600 characters ≈ 100-150 tokens. Adjust based on domain complexity and retrieval precision needs.
Chunk overlap benefits:
- Prevents semantic boundaries from splitting related concepts
- Improves context when chunks are used individually
- Increases index size and query latency proportionally
Character Text Splitter
CharacterTextSplitter splits text by character count. Best for simple text and quick prototyping.
Example:
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500, # Characters per chunk
)
chunks = splitter.split_document(document)Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 1000 | Maximum number of characters per chunk |
Token Text Splitter
Token-aware splitting using a tokenizer for LLM context window management and precise token control.
Example:
from mistralai.search.toolkit.ingestion.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=1000, # Tokens per chunk
chunk_overlap=200, # Token overlap
)
chunks = splitter.split_document(document)Configuration options:
| Option | Type | Default | Purpose |
|---|---|---|---|
chunk_size | int | 1000 | Tokens per chunk (uses Mistral tokenizer) |
chunk_overlap | int | 0 | Tokens of overlap between chunks |
tokenizer_model | str | "mistral" | Tokenizer to use for counting |
When to use:
- Precise token budgeting for LLM context windows
- Consistent chunks across different text densities
- When working with specific token limits
MarkdownTextSplitter
MarkdownTextSplitter splits markdown documents at header boundaries. It inherits from SeparatorTextSplitter and adds header-aware splitting with configurable header levels.
Requirements:
Install the text-splitter-langchain extra:
uv add "mistralai-search-toolkit[text-splitter-langchain]"Example:
from mistralai.search.toolkit.ingestion.text_splitters import MarkdownTextSplitter, MarkdownTextSplitterConfig
config = MarkdownTextSplitterConfig(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
strip_headers=False, # Keep headers in chunks
chunk_size=1000, # Inherited from SeparatorTextSplitterConfig
chunk_overlap=200, # Inherited from SeparatorTextSplitterConfig
)
splitter = MarkdownTextSplitter(config=config)
chunks = splitter.split_document(document)Parameters (MarkdownTextSplitterConfig):
MarkdownTextSplitterConfig extends SeparatorTextSplitterConfig, adding:
| Parameter | Type | Default | Description |
|---|---|---|---|
headers_to_split_on | list[tuple[str, str]] | [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")] | Markdown headers to split on — each tuple is (header_prefix, label) |
strip_headers | bool | False | Remove header lines from chunk content |
All parameters from SeparatorTextSplitterConfig are also available (chunk_size, chunk_max_size, chunk_overlap, chunk_separators, keep_separator, strip_whitespace).
SeparatorTextSplitter
Recursive splitting with custom separators for handling custom document structures with precise control over split points and hierarchical text organization.
Example:
from mistralai.search.toolkit.ingestion.text_splitters import SeparatorTextSplitter, SeparatorTextSplitterConfig
config = SeparatorTextSplitterConfig(
chunk_size=1000, # Target size (characters)
chunk_max_size=1500, # Maximum size (characters)
chunk_overlap=200, # Overlap (characters)
chunk_separators=[ # Tried in order
"\n\n", # Paragraphs first
"\n", # Then lines
". ", # Then sentences
" ", # Then words
"", # Finally characters
],
)
splitter = SeparatorTextSplitter(config=config)
chunks = splitter.split_document(document)Features:
- Tries separators in order
- Respects max size even if splitting fails
- Merges small chunks when possible
- Returns start/end offsets
Creating custom splitters
Implement the TextSplitter protocol:
from mistralai.search.toolkit.ingestion.text_splitters import TextSplitter
from mistralai.search.toolkit.document import DocumentChunk
class ParagraphSplitter(TextSplitter):
"""Split text into chunks by paragraphs."""
def __init__(self, max_paragraphs_per_chunk: int = 3):
self.max_paragraphs_per_chunk = max_paragraphs_per_chunk
def split_text(self, text: str, **kwargs) -> list[DocumentChunk]:
paragraphs = text.split("\n\n")
chunks = []
offset = 0
for i in range(0, len(paragraphs), self.max_paragraphs_per_chunk):
chunk_paragraphs = paragraphs[i:i + self.max_paragraphs_per_chunk]
content = "\n\n".join(chunk_paragraphs)
chunks.append(DocumentChunk(
content=content,
start_offset=offset,
end_offset=offset + len(content),
))
offset += len(content) + 2
return chunks
splitter = ParagraphSplitter(max_paragraphs_per_chunk=3)
chunks = splitter.split_text("First paragraph.\n\nSecond paragraph.\n\nThird paragraph.")