File loaders load files from various sources into File objects that can be processed by document extractors.
Available file loaders
| Loader | Source |
|---|---|
| Filesystem File Loader | Local filesystem |
| AWS S3 File Loader | AWS S3 (and S3-compatible: MinIO, Ceph) |
| Google Cloud Storage Loader | Google Cloud Storage |
| Azure Blob Storage Loader | Azure Blob Storage |
| Custom Loaders | Any source |
Filesystem File Loader
FilesystemFileLoader loads files from the local filesystem. It accepts a root parameter that restricts access — any path resolving outside the root is rejected, preventing path traversal.
Installation: Core library (no extra required)
Example:
from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
loader = FilesystemFileLoader(root="/data/documents")
file = await loader.load_file(Path("report.pdf"))Limiting file size:
Use max_file_size to reject files that exceed a given size in bytes. The check is performed via a metadata lookup before loading, so oversized files are never read into memory:
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
loader = FilesystemFileLoader(
root="/data/documents",
max_file_size=50 * 1024 * 1024, # 50 MiB
)
file = await loader.load_file("report.pdf")A FileSizeLimitExceededException is raised when a file exceeds the limit.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
root | Path | str | "/" | Root directory — paths resolving outside this directory are rejected |
max_file_size | int | None | None | Maximum file size in bytes. Files exceeding this limit are rejected before loading. None means no limit. |
Security:
The root parameter prevents path traversal attacks. Any path resolving outside the root directory is rejected before attempting to open the file.
AWS S3 File Loader
Load files from AWS S3 buckets (or S3-compatible services like MinIO, Ceph).
Installation:
uv add "mistralai-search-toolkit-storage-s3"Example:
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage
loader = FileLoader(
lambda: S3BlobStorage(
bucket_name="my-bucket",
region_name="us-east-1",
endpoint_url="http://localhost:9000", # optional for MinIO / other S3-compatible backends
)
)
file = await loader.load_file("reports/example.pdf")Parameters (S3BlobStorage):
| Parameter | Type | Default | Description |
|---|---|---|---|
bucket_name | str | (required) | S3 bucket name |
region_name | str | "us-east-1" | AWS region |
endpoint_url | str | None | None | Custom endpoint URL (for MinIO, Ceph, or other S3-compatible services) |
Authentication:
Uses the default AWS credentials chain:
- IAM role (recommended for production)
- Environment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY - AWS credentials file:
~/.aws/credentials
Azure Blob Storage Loader
Load files from Azure Blob Storage.
Installation:
uv add "mistralai-search-toolkit-storage-azure"Example:
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.azure import AzureBlobStorage
loader = FileLoader(
lambda: AzureBlobStorage(
container_name="my-container",
account_url="https://myaccount.blob.core.windows.net",
use_workload_identity=True, # or pass azure_connection_string=... for connection-string auth
)
)
file = await loader.load_file("reports/example.pdf")Parameters (AzureBlobStorage):
| Parameter | Type | Default | Description |
|---|---|---|---|
container_name | str | (required) | Azure Blob Storage container name |
account_url | str | (required) | Azure storage account URL (e.g., https://myaccount.blob.core.windows.net) |
use_workload_identity | bool | False | Use workload identity for authentication (recommended for Azure VMs/Functions) |
azure_connection_string | str | None | None | Connection string for authentication (alternative to workload identity) |
Authentication:
- Workload Identity (recommended for Azure VMs/Functions)
- Connection string: Pass via
azure_connection_stringparameter or setAZURE_STORAGE_CONNECTION_STRINGenvironment variable - SAS token: Time-limited access (configure via account credentials)
Google Cloud Storage Loader
Load files from Google Cloud Storage (GCS).
Installation:
uv add "mistralai-search-toolkit-storage-gcs"Example:
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage
loader = FileLoader(
lambda: GCSBlobStorage(
bucket_name="my-bucket",
project="my-gcp-project",
service_account_file="/path/to/service-account.json", # optional; falls back to ADC
)
)
file = await loader.load_file("reports/example.pdf")Parameters (GCSBlobStorage):
| Parameter | Type | Default | Description |
|---|---|---|---|
bucket_name | str | (required) | GCS bucket name |
project | str | (required) | GCP project ID |
service_account_file | str | None | None | Path to service account JSON file. If not provided, uses Application Default Credentials (ADC) |
Authentication:
- Application Default Credentials (ADC) (recommended) — uses credentials from environment, metadata service, or gcloud CLI
- Service account file: Pass via
service_account_fileparameter or setGOOGLE_APPLICATION_CREDENTIALSenvironment variable
Batch Loading
Load multiple files with concurrency control:
import asyncio
from mistralai.search.toolkit.ingestion.loaders import FileLoader
async def load_files_batch(
loader: FileLoader,
paths: list[str],
max_concurrent: int = 10,
) -> list:
"""Load multiple files concurrently with semaphore."""
semaphore = asyncio.Semaphore(max_concurrent)
async def load_with_semaphore(path: str):
async with semaphore:
try:
return await loader.load_file(path)
except Exception as e:
print(f"Failed to load {path}: {e}")
return None
results = await asyncio.gather(
*[load_with_semaphore(p) for p in paths],
return_exceptions=False,
)
return [f for f in results if f is not None]
# Usage
loader = FileLoader(lambda: S3BlobStorage(bucket_name="my-bucket"))
files = await load_files_batch(
loader,
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
max_concurrent=5,
)Custom Loaders
Implement the FileLoader protocol for custom sources not covered above:
from pathlib import Path
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion import File
class CustomFileLoader(FileLoader):
"""Load files from a custom source."""
async def load_file(self, path: Path | str) -> File:
# Implement your custom loading logic
content = await self._fetch_from_custom_source(str(path))
filename = Path(str(path)).name
return File(
path=str(path),
name=filename,
raw=content,
)
async def _fetch_from_custom_source(self, path: str) -> bytes:
# Your implementation here
...