File loaders load files from various sources into File objects that can be processed by document extractors.

Available file loaders

Available file loaders

LoaderSource
Filesystem File LoaderLocal filesystem
AWS S3 File LoaderAWS S3 (and S3-compatible: MinIO, Ceph)
Google Cloud Storage LoaderGoogle Cloud Storage
Azure Blob Storage LoaderAzure Blob Storage
Custom LoadersAny source
Filesystem File Loader

Filesystem File Loader

FilesystemFileLoader loads files from the local filesystem. It accepts a root parameter that restricts access — any path resolving outside the root is rejected, preventing path traversal.

Installation: Core library (no extra required)

Example:

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(root="/data/documents")
file = await loader.load_file(Path("report.pdf"))

Limiting file size:

Use max_file_size to reject files that exceed a given size in bytes. The check is performed via a metadata lookup before loading, so oversized files are never read into memory:

from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader

loader = FilesystemFileLoader(
    root="/data/documents",
    max_file_size=50 * 1024 * 1024,  # 50 MiB
)
file = await loader.load_file("report.pdf")

A FileSizeLimitExceededException is raised when a file exceeds the limit.

Parameters:

ParameterTypeDefaultDescription
rootPath | str"/"Root directory — paths resolving outside this directory are rejected
max_file_sizeint | NoneNoneMaximum file size in bytes. Files exceeding this limit are rejected before loading. None means no limit.

Security:

The root parameter prevents path traversal attacks. Any path resolving outside the root directory is rejected before attempting to open the file.

AWS S3 File Loader

AWS S3 File Loader

Load files from AWS S3 buckets (or S3-compatible services like MinIO, Ceph).

Installation:

uv add "mistralai-search-toolkit-storage-s3"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage

loader = FileLoader(
    lambda: S3BlobStorage(
        bucket_name="my-bucket",
        region_name="us-east-1",
        endpoint_url="http://localhost:9000",  # optional for MinIO / other S3-compatible backends
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (S3BlobStorage):

ParameterTypeDefaultDescription
bucket_namestr(required)S3 bucket name
region_namestr"us-east-1"AWS region
endpoint_urlstr | NoneNoneCustom endpoint URL (for MinIO, Ceph, or other S3-compatible services)

Authentication:

Uses the default AWS credentials chain:

  • IAM role (recommended for production)
  • Environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
  • AWS credentials file: ~/.aws/credentials
Azure Blob Storage Loader

Azure Blob Storage Loader

Load files from Azure Blob Storage.

Installation:

uv add "mistralai-search-toolkit-storage-azure"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.azure import AzureBlobStorage

loader = FileLoader(
    lambda: AzureBlobStorage(
        container_name="my-container",
        account_url="https://myaccount.blob.core.windows.net",
        use_workload_identity=True,  # or pass azure_connection_string=... for connection-string auth
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (AzureBlobStorage):

ParameterTypeDefaultDescription
container_namestr(required)Azure Blob Storage container name
account_urlstr(required)Azure storage account URL (e.g., https://myaccount.blob.core.windows.net)
use_workload_identityboolFalseUse workload identity for authentication (recommended for Azure VMs/Functions)
azure_connection_stringstr | NoneNoneConnection string for authentication (alternative to workload identity)

Authentication:

  • Workload Identity (recommended for Azure VMs/Functions)
  • Connection string: Pass via azure_connection_string parameter or set AZURE_STORAGE_CONNECTION_STRING environment variable
  • SAS token: Time-limited access (configure via account credentials)
Google Cloud Storage Loader

Google Cloud Storage Loader

Load files from Google Cloud Storage (GCS).

Installation:

uv add "mistralai-search-toolkit-storage-gcs"

Example:

from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage

loader = FileLoader(
    lambda: GCSBlobStorage(
        bucket_name="my-bucket",
        project="my-gcp-project",
        service_account_file="/path/to/service-account.json",  # optional; falls back to ADC
    )
)

file = await loader.load_file("reports/example.pdf")

Parameters (GCSBlobStorage):

ParameterTypeDefaultDescription
bucket_namestr(required)GCS bucket name
projectstr(required)GCP project ID
service_account_filestr | NoneNonePath to service account JSON file. If not provided, uses Application Default Credentials (ADC)

Authentication:

  • Application Default Credentials (ADC) (recommended) — uses credentials from environment, metadata service, or gcloud CLI
  • Service account file: Pass via service_account_file parameter or set GOOGLE_APPLICATION_CREDENTIALS environment variable
Batch Loading

Batch Loading

Load multiple files with concurrency control:

import asyncio
from mistralai.search.toolkit.ingestion.loaders import FileLoader

async def load_files_batch(
    loader: FileLoader,
    paths: list[str],
    max_concurrent: int = 10,
) -> list:
    """Load multiple files concurrently with semaphore."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def load_with_semaphore(path: str):
        async with semaphore:
            try:
                return await loader.load_file(path)
            except Exception as e:
                print(f"Failed to load {path}: {e}")
                return None

    results = await asyncio.gather(
        *[load_with_semaphore(p) for p in paths],
        return_exceptions=False,
    )
    return [f for f in results if f is not None]

# Usage
loader = FileLoader(lambda: S3BlobStorage(bucket_name="my-bucket"))
files = await load_files_batch(
    loader,
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    max_concurrent=5,
)
Custom Loaders

Custom Loaders

Implement the FileLoader protocol for custom sources not covered above:

from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion import File

class CustomFileLoader(FileLoader):
    """Load files from a custom source."""

    async def load_file(self, path: Path | str) -> File:
        # Implement your custom loading logic
        content = await self._fetch_from_custom_source(str(path))
        filename = Path(str(path)).name

        return File(
            path=str(path),
            name=filename,
            raw=content,
        )

    async def _fetch_from_custom_source(self, path: str) -> bytes:
        # Your implementation here
        ...