Document extractors extract the content from a File into a Document object that can be later processed. Different file types require different extraction logic — PDFs can be extracted through OCR, while Excel files require tabular extraction.

All built-in extractors set the document ID to file.id.

Available document extractors

Available document extractors

ExtractorFile types
Mistral OCR ExtractorPDF, DOCX, PPTX, ODT
Mistral Audio Transcription ExtractorMP3, WAV, M4A, FLAC, OGG
Plain Text ExtractorTXT, MD, CSV, JS, PY (and other text/code files)
HTML ExtractorHTML, HTM
Spreadsheet ExtractorXLS, XLSX, XLSM, XLSB, ODS, ODF
Email ExtractorEML, MSG
Numbers ExtractorNUMBERS
Legacy Office ExtractorDOC, PPT, HWP, HWPX
Custom ExtractorsAny source
Mistral OCR Extractor

Mistral OCR Extractor

MistralOCRExtractor uses the Mistral OCR API to extract structured text and images from PDF, DOCX, PPTX, and ODT files. It is the recommended extractor for scanned and image-heavy documents.

Requirements:

  • A Mistral API key

Installation: Core library (no extra required)

Example:

import os

from mistralai.client import Mistral
from mistralai.search.toolkit.ingestion.extractors import MistralOCRExtractor

mistral_client = Mistral(
    api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"),
    server_url=os.environ.get("MISTRAL_SERVER_URL"),  # Optional
)
extractor = MistralOCRExtractor(
    client=mistral_client,
    include_image_base64=True,      # Include image data
    include_image_annotation=True,  # Add image annotations
)
document = await extractor.extract(file)

Parameters:

ParameterTypeDefaultDescription
clientMistral(required)Mistral client instance
model_namestr"mistral-ocr-latest"OCR model name
timeout_secondsint900Request timeout in seconds
strip_page_markdownboolTrueStrip leading/trailing whitespace from page markdown
populate_contentboolTruePopulate the document content field
include_image_base64boolFalseInclude base64 image data in the document
include_image_annotationboolFalseAdd image annotations to page markdown
max_file_size_bytesint | NoneNoneMaximum file size before splitting into parts
pages_split_sizeint | NoneNoneNumber of pages per split when file exceeds max_file_size_bytes
pages_group_sizeint | NoneNoneNumber of pages per API request
http_headersMapping[str, str] | NoneNoneCustom HTTP headers forwarded to the OCR API
image_limitint | NoneNoneMaximum number of images to include per page
pagesSequence[int] | NoneNoneSpecific page numbers to extract (1-based)
table_format"markdown" | "html" | NoneNoneTable output format

Features:

  • OCR for scanned documents and images
  • Markdown output with structure preservation
  • Image reference tracking and optional base64 encoding
  • Configurable page extraction (specific pages or page ranges)
  • Table format customization (markdown or HTML)
  • Large file handling with automatic splitting
Mistral Audio Transcription Extractor

Mistral Audio Transcription Extractor

MistralAudioTranscriptionExtractor uses the Mistral audio transcription API to transcribe audio files into text. It supports speaker diarization and timestamp granularities.

Requirements:

  • A Mistral API key

Installation: Core library (no extra required)

Example:

import os

from mistralai.client import Mistral
from mistralai.search.toolkit.ingestion.extractors import MistralAudioTranscriptionExtractor

extractor = MistralAudioTranscriptionExtractor(
    client=Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key")),
    model_name="voxtral-mini-latest",  # Default model
    language="en",          # Optional language hint
    diarize=True,           # Enable speaker diarization
    timeout_seconds=900,    # Transcription timeout (default)
)
document = await extractor.extract(file)

Parameters:

ParameterTypeDefaultDescription
clientMistral(required)Mistral client instance
model_namestr"voxtral-mini-latest"Transcription model name
languagestr | NoneNoneTarget language hint for transcription
diarizeboolFalseEnable speaker diarization
timestamp_granularitieslist[str] | NoneNoneTimestamp granularities (e.g. ["word", "segment"])
timeout_secondsint900Request timeout in seconds
populate_contentboolTruePopulate the document content field
http_headersMapping[str, str] | NoneNoneCustom HTTP headers forwarded to the transcription API

Features:

  • Audio transcription with speaker diarization support
  • Timestamp granularities for word and segment-level precision
  • Language hint support for improved accuracy
  • Support for raw file uploads and remote URL sources
  • Customizable timeout for long audio files
Plain Text Extractor

Plain Text Extractor

Extractor for plain text-like documents (txt, md, code, csv, etc.).

Installation: Core library (no extra required)

Example:

from mistralai.search.toolkit.ingestion.extractors import PlainTextExtractor

extractor = PlainTextExtractor(
    page_size=2000,    # Characters per page (default)
    encoding="utf-8",  # Default encoding
)
document = await extractor.extract(file)

Configuration options:

OptionTypeDefaultPurpose
page_sizeint2000Characters per page (splits large files into logical pages)
encodingstr"utf-8"Character encoding for file reading
skip_encoding_detectionboolFalseSkip encoding auto-detection (use specified encoding only)

Behavior:

  • Files larger than page_size are split into multiple pages
  • Each page becomes a document page with page_number metadata
  • The entire file content is preserved in document.content
  • Encoding detection tries UTF-8 first, then falls back to other encodings if enabled
HTML Extractor

HTML Extractor

HTMLExtractor parses HTML and HTM files, converting them to clean markdown for downstream chunking and retrieval. It strips boilerplate elements (navigation, footers, scripts) by default.

Installation:

For default markdown conversion:

uv add "mistralai-search-toolkit[html-converter-markdownify]"

Or use core library only if providing a custom converter.

Example:

from mistralai.search.toolkit.ingestion.extractors import HTMLExtractor

extractor = HTMLExtractor()
document = await extractor.extract(file)

The default converter is MarkdownifyConverter, backed by the markdownify library (MIT). It is tuned to produce clean markdown with ATX headings, consistent bullet styles, code fence language detection, and full subtree stripping for boilerplate tags.

Configuring the default converter:

MarkdownifyConverter accepts several options to customize its behavior:

from mistralai.search.toolkit.ingestion.extractors import HTMLExtractor, MarkdownifyConverter

# Use default settings
converter = MarkdownifyConverter()

# Or customize: only strip <script> and <style>, disable id/class filtering
converter = MarkdownifyConverter(
    ignore_tags=["script", "style"],
    ignore_ids=[],
    ignore_classes=[],
)

extractor = HTMLExtractor(converter=converter)
document = await extractor.extract(file)

MarkdownifyConverter options:

OptionTypeDefaultDescription
ignore_tagslist[str] | NoneDEFAULT_IGNORE_TAGSHTML tags to strip entirely, including all their children. When None, uses the default list.
ignore_idslist[str] | NoneDEFAULT_IGNORE_IDSElement id attribute values to strip (literal match). The entire subtree is removed. When None, uses the default list.
ignore_classeslist[str] | NoneDEFAULT_IGNORE_CLASSESRegex patterns matched against each CSS class on an element. When any class matches, the entire subtree is removed. When None, uses the default list.
escape_miscboolTrueEscape miscellaneous markdown characters (e.g. | in table cells).

Default ignored tags: head, header, script, style, title, footer, form, button, nav, iframe.

Default ignored IDs: footer, sidebar, cookie, metadata.

Default ignored class patterns: footer, ^ad-, ^ad_, ^menu$, ^newsletter$, ^metadata$, ^muted$, vot(e|ing).

Custom converter:

You can inject any object that implements the HtmlToMarkdownConverter protocol:

from mistralai.search.toolkit.ingestion.extractors import HTMLExtractor, HtmlToMarkdownConverter

class MyConverter(HtmlToMarkdownConverter):
    def convert(self, html: str) -> str:
        return html  # your conversion logic

extractor = HTMLExtractor(converter=MyConverter())
document = await extractor.extract(file)

Parameters:

ParameterTypeDefaultDescription
encodingstr"utf-8"HTML file encoding
decode_errorsstr"strict"Error handling for decoding ("strict", "ignore", "replace")
converterHtmlToMarkdownConverter | NoneNoneCustom converter instance; when None, the default MarkdownifyConverter is used

Features:

  • Converts HTML to clean markdown with customizable stripping
  • Preserves structure (headings, lists, tables)
  • Boilerplate removal (navigation, footers, ads, etc.)
  • Custom converter support for alternative HTML-to-markdown implementations
Spreadsheet Extractor

Spreadsheet Extractor

Extractor for spreadsheet documents (XLSX/XLS/ODS), yielding CSV per sheet.

Installation:

uv add "mistralai-search-toolkit[extractor-spreadsheet]"

Example:

from mistralai.search.toolkit.ingestion.extractors import SpreadsheetExtractor

extractor = SpreadsheetExtractor(
    include_sheet_name=True,  # Include sheet name in output
    row_limit=None,           # Optional row limit per sheet
    col_limit=None,           # Optional column limit per sheet
)
document = await extractor.extract(file)

Configuration options:

OptionTypeDefaultPurpose
include_sheet_nameboolTruePrepend sheet name to each block of CSV
row_limitint | NoneNoneMaximum rows per sheet (None = no limit)
col_limitint | NoneNoneMaximum columns per sheet (None = no limit)
skip_empty_sheetsboolTrueSkip sheets with no data
preserve_formula_valuesboolTrueUse formula results, not formula text

Behavior:

  • Each sheet becomes a separate CSV block in the document
  • Merged cells are expanded with values repeated
  • Formulas are evaluated (unless preserve_formula_values=False)
  • Headers are inferred from first row
Email Extractor

Email Extractor

Extract email files (.eml and .msg) into a single markdown-backed document with subject, sender, recipients, date, and body. HTML bodies are converted to markdown when available.

Installation:

uv add "mistralai-search-toolkit[extractor-email]"

Adds eml-parser and extract-msg dependencies.

Example:

from mistralai.search.toolkit.ingestion.extractors import EmailExtractor

extractor = EmailExtractor()
document = await extractor.extract(file)

Features:

  • Parses EML (RFC 822) and Outlook MSG formats
  • Outputs structured markdown (subject, from, to, CC, date, body)
  • Prefers plain-text body; falls back to HTML converted to markdown

Attachments: Use extract_email_attachments (or extract_eml_attachments / extract_msg_attachments) to process attachments from the same email file. They return a list of EmailAttachment (filename, content_type, data, extension) that you can pass to other extractors:

from mistralai.search.toolkit.ingestion.extractors import (
    EmailExtractor,
    EmailAttachment,
    extract_email_attachments,
)

document = await EmailExtractor().extract(file)
attachments = extract_email_attachments(file.raw, extension="eml")
for att in attachments:
    # Build a File from att.data and run through the right extractor
    ...
Numbers Extractor

Numbers Extractor

Extractor for Apple Numbers documents (.numbers), yielding CSV per table/sheet.

Installation:

uv add "mistralai-search-toolkit[extractor-spreadsheet]"

Example:

from mistralai.search.toolkit.ingestion.extractors import NumbersExtractor, NumbersOptions

extractor = NumbersExtractor(
    row_limit=1000,           # Cap rows per table
    col_limit=50,             # Cap columns per table
    include_sheet_name=True,  # Prefix each block with sheet/table name
)
document = await extractor.extract(file)

Use iter_csv_pages(blob, options=...) to iterate over CSV strings without building a full Document.

Legacy Office Extractor

Legacy Office Extractor

Extract legacy Office formats (.doc, .ppt, .hwp, .hwpx) by converting them to PDF with PyMuPDF Pro, then running Mistral OCR. Does not handle .xls — use SpreadsheetExtractor for that.

Installation:

uv add "mistralai-search-toolkit[extractor-pymupdf]"

Requires PyMuPDF Pro license for full format support.

Example:

from mistralai.client import Mistral
from mistralai.search.toolkit.ingestion.extractors import LegacyOfficeExtractor, MistralOCRExtractor

ocr_extractor = MistralOCRExtractor(
    client=Mistral(api_key="your-api-key")
)
extractor = LegacyOfficeExtractor(
    ocr_extractor,
    pymupdf_license_key="your-license-key",  # Optional
)
document = await extractor.extract(
    file,
    include_image_base64=True,
    include_image_annotation=True,
)

Features:

  • Converts legacy document to PDF in memory, then delegates to MistralOCRExtractor
  • Supports the same OCR options as Mistral OCR (e.g. http_headers, image_limit)
Creating custom extractors

Creating custom extractors

Implement the DocumentExtractor protocol:

import base64
from mistralai.search.toolkit.ingestion.extractors import DocumentExtractor
from mistralai.search.toolkit.document import Document, Page
from mistralai.search.toolkit.ingestion import File

class Base64FileExtractor(DocumentExtractor):
    """Extract content from base64-encoded text files."""

    async def extract(self, file: File, **kwargs) -> Document:
        decoded_content = base64.b64decode(file.raw).decode("utf-8")

        return Document(
            id=file.id,
            extractor_type="text",
            content=decoded_content,
            pages=[Page(page_number=1, markdown=decoded_content, ref_to_images={})],
            metadata={"encoding": "base64"},
            filename=file.name,
            filepath=file.path,
        )

extractor = Base64FileExtractor()
document = await extractor.extract(b64_file)