File loaders | Mistral Docs

File loaders load files from various sources into File objects that can be processed by document extractors.

Available file loaders

Loader	Source
Filesystem File Loader	Local filesystem
AWS S3 File Loader	AWS S3 (and S3-compatible: MinIO, Ceph)
Google Cloud Storage Loader	Google Cloud Storage
Azure Blob Storage Loader	Azure Blob Storage
Custom Loaders	Any source

Filesystem File Loader

FilesystemFileLoader loads files from the local filesystem. It accepts a root parameter that restricts access — any path resolving outside the root is rejected, preventing path traversal.

Installation: Core library (no extra required)

Example:

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(root="/data/documents")
file = await loader.load_file(Path("report.pdf"))

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(root="/data/documents")
file = await loader.load_file(Path("report.pdf"))

Limiting file size:

Use max_file_size to reject files that exceed a given size in bytes. The check is performed via a metadata lookup before loading, so oversized files are never read into memory:

from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(
    root="/data/documents",
    max_file_size=50 * 1024 * 1024,  # 50 MiB
)
file = await loader.load_file("report.pdf")

from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(
    root="/data/documents",
    max_file_size=50 * 1024 * 1024,  # 50 MiB
)
file = await loader.load_file("report.pdf")

A FileSizeLimitExceededException is raised when a file exceeds the limit.

Parameters:

Parameter	Type	Default	Description
`root`	`Path \| str`	`"/"`	Root directory — paths resolving outside this directory are rejected
`max_file_size`	`int \| None`	`None`	Maximum file size in bytes. Files exceeding this limit are rejected before loading. `None` means no limit.

Security:

The root parameter prevents path traversal attacks. Any path resolving outside the root directory is rejected before attempting to open the file.

AWS S3 File Loader

Load files from AWS S3 buckets (or S3-compatible services like MinIO, Ceph).

Installation:

uv add "mistralai-search-toolkit-storage-s3"

uv add "mistralai-search-toolkit-storage-s3"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage

loader = FileLoader(
    lambda: S3BlobStorage(
        bucket_name="my-bucket",
        region_name="us-east-1",
        endpoint_url="http://localhost:9000",  # optional for MinIO / other S3-compatible backends
    )
)

file = await loader.load_file("reports/example.pdf")

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage

loader = FileLoader(
    lambda: S3BlobStorage(
        bucket_name="my-bucket",
        region_name="us-east-1",
        endpoint_url="http://localhost:9000",  # optional for MinIO / other S3-compatible backends
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (S3BlobStorage):

Parameter	Type	Default	Description
`bucket_name`	`str`	(required)	S3 bucket name
`region_name`	`str`	`"us-east-1"`	AWS region
`endpoint_url`	`str \| None`	`None`	Custom endpoint URL (for MinIO, Ceph, or other S3-compatible services)

Authentication:

Uses the default AWS credentials chain:

IAM role (recommended for production)
Environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
AWS credentials file: ~/.aws/credentials

Azure Blob Storage Loader

Load files from Azure Blob Storage.

Installation:

uv add "mistralai-search-toolkit-storage-azure"

uv add "mistralai-search-toolkit-storage-azure"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.azure import AzureBlobStorage

loader = FileLoader(
    lambda: AzureBlobStorage(
        container_name="my-container",
        account_url="https://myaccount.blob.core.windows.net",
        use_workload_identity=True,  # or pass azure_connection_string=... for connection-string auth
    )
)

file = await loader.load_file("reports/example.pdf")

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.azure import AzureBlobStorage

loader = FileLoader(
    lambda: AzureBlobStorage(
        container_name="my-container",
        account_url="https://myaccount.blob.core.windows.net",
        use_workload_identity=True,  # or pass azure_connection_string=... for connection-string auth
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (AzureBlobStorage):

Parameter	Type	Default	Description
`container_name`	`str`	(required)	Azure Blob Storage container name
`account_url`	`str`	(required)	Azure storage account URL (e.g., `https://myaccount.blob.core.windows.net`)
`use_workload_identity`	`bool`	`False`	Use workload identity for authentication (recommended for Azure VMs/Functions)
`azure_connection_string`	`str \| None`	`None`	Connection string for authentication (alternative to workload identity)

Authentication:

Workload Identity (recommended for Azure VMs/Functions)
Connection string: Pass via azure_connection_string parameter or set AZURE_STORAGE_CONNECTION_STRING environment variable
SAS token: Time-limited access (configure via account credentials)

Google Cloud Storage Loader

Load files from Google Cloud Storage (GCS).

Installation:

uv add "mistralai-search-toolkit-storage-gcs"

uv add "mistralai-search-toolkit-storage-gcs"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage

loader = FileLoader(
    lambda: GCSBlobStorage(
        bucket_name="my-bucket",
        project="my-gcp-project",
        service_account_file="/path/to/service-account.json",  # optional; falls back to ADC
    )
)

file = await loader.load_file("reports/example.pdf")

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage

loader = FileLoader(
    lambda: GCSBlobStorage(
        bucket_name="my-bucket",
        project="my-gcp-project",
        service_account_file="/path/to/service-account.json",  # optional; falls back to ADC
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (GCSBlobStorage):

Parameter	Type	Default	Description
`bucket_name`	`str`	(required)	GCS bucket name
`project`	`str`	(required)	GCP project ID
`service_account_file`	`str \| None`	`None`	Path to service account JSON file. If not provided, uses Application Default Credentials (ADC)

Authentication:

Application Default Credentials (ADC) (recommended) — uses credentials from environment, metadata service, or gcloud CLI
Service account file: Pass via service_account_file parameter or set GOOGLE_APPLICATION_CREDENTIALS environment variable

Batch Loading

Load multiple files with concurrency control:

import asyncio
from mistralai.search.toolkit.ingestion.loaders import FileLoader

async def load_files_batch(
    loader: FileLoader,
    paths: list[str],
    max_concurrent: int = 10,
) -> list:
    """Load multiple files concurrently with semaphore."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def load_with_semaphore(path: str):
        async with semaphore:
            try:
                return await loader.load_file(path)
            except Exception as e:
                print(f"Failed to load {path}: {e}")
                return None

    results = await asyncio.gather(
        *[load_with_semaphore(p) for p in paths],
        return_exceptions=False,
    )
    return [f for f in results if f is not None]

# Usage
loader = FileLoader(lambda: S3BlobStorage(bucket_name="my-bucket"))
files = await load_files_batch(
    loader,
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    max_concurrent=5,
)

import asyncio
from mistralai.search.toolkit.ingestion.loaders import FileLoader

async def load_files_batch(
    loader: FileLoader,
    paths: list[str],
    max_concurrent: int = 10,
) -> list:
    """Load multiple files concurrently with semaphore."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def load_with_semaphore(path: str):
        async with semaphore:
            try:
                return await loader.load_file(path)
            except Exception as e:
                print(f"Failed to load {path}: {e}")
                return None

    results = await asyncio.gather(
        *[load_with_semaphore(p) for p in paths],
        return_exceptions=False,
    )
    return [f for f in results if f is not None]

# Usage
loader = FileLoader(lambda: S3BlobStorage(bucket_name="my-bucket"))
files = await load_files_batch(
    loader,
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    max_concurrent=5,
)

Custom Loaders

Implement the FileLoader protocol for custom sources not covered above:

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion import File

class CustomFileLoader(FileLoader):
    """Load files from a custom source."""

    async def load_file(self, path: Path | str) -> File:
        # Implement your custom loading logic
        content = await self._fetch_from_custom_source(str(path))
        filename = Path(str(path)).name

        return File(
            path=str(path),
            name=filename,
            raw=content,
        )

    async def _fetch_from_custom_source(self, path: str) -> bytes:
        # Your implementation here
        ...

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion import File

class CustomFileLoader(FileLoader):
    """Load files from a custom source."""

    async def load_file(self, path: Path | str) -> File:
        # Implement your custom loading logic
        content = await self._fetch_from_custom_source(str(path))
        filename = Path(str(path)).name

        return File(
            path=str(path),
            name=filename,
            raw=content,
        )

    async def _fetch_from_custom_source(self, path: str) -> bytes:
        # Your implementation here
        ...