Healthcare generates 30% of the world's data, yet much of it remains locked in PDFs, scanned faxes, and unstructured documents. As regulations like the CMS prior authorization mandate push toward digital-first operations and hospital staffing shortages intensify, automated document processing has become critical infrastructure, not just for patient intake, but for back office operations like invoice management, medical billing and coding, and clinical documentation at scale.

Key challenges driving Document AI adoption:

30% of global data originates in healthcare, mostly unstructured
Legacy systems rely on paper, fax, and non-digital formats
Regulatory pressure (CMS mandates, interoperability requirements)
Severe staffing shortages across clinical and administrative roles

Mistral OCR 3 handles intricate healthcare documents—handwritten notes, nested lab tables, checkboxes, and multi-page forms—with accuracy comparable to commercial solutions at a fraction of the cost. This cookbook demonstrates how to get started.

You can also interactively explore Document AI in AI Studio

1. Setup

First, let's install mistralai and download the document.

%%capture
!pip install mistralai

%%capture
!pip install mistralai

Sample Document

This cookbook uses patient-packet-completed.pdf - a synthetic multi-page patient packet containing demographics, vitals, and clinical notes.

%%capture
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/hcls/patient-packet-completed.pdf

%%capture
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/hcls/patient-packet-completed.pdf

# Verify sample document exists
import os

# Path to your pdf (using local file)
pdf_path = "patient-packet-completed.pdf"

if os.path.exists(pdf_path):
    print(f"✅ Found: {pdf_path}")
    print(f"   Size: {os.path.getsize(pdf_path) / 1024:.1f} KB")
else:
    print(f"❌ File not found: {pdf_path}")
    print("   Please ensure patient-packet-completed.pdf is in the working directory")

# List available sample files in the workspace
!ls -la *.pdf 2>/dev/null || echo "No PDF files found in current directory"

# Verify sample document exists
import os

# Path to your pdf (using local file)
pdf_path = "patient-packet-completed.pdf"

if os.path.exists(pdf_path):
    print(f"✅ Found: {pdf_path}")
    print(f"   Size: {os.path.getsize(pdf_path) / 1024:.1f} KB")
else:
    print(f"❌ File not found: {pdf_path}")
    print("   Please ensure patient-packet-completed.pdf is in the working directory")

# List available sample files in the workspace
!ls -la *.pdf 2>/dev/null || echo "No PDF files found in current directory"

Create Client

We will need to set up our client. You can create an API key on AI Studio.

# Initialize Mistral client with API key
import os
from mistralai import Mistral
from google.colab import userdata
import requests

api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key

if not api_key:
    print("⚠️  WARNING: No API key found!")
else:
    client = Mistral(api_key=api_key)
    print("✅ Mistral client initialized")

# Initialize Mistral client with API key
import os
from mistralai import Mistral
from google.colab import userdata
import requests

api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key

if not api_key:
    print("⚠️  WARNING: No API key found!")
else:
    client = Mistral(api_key=api_key)
    print("✅ Mistral client initialized")

2. Use Case: Patient Medical Record Packet OCR Processing

This section showcases Mistral OCR 3 capabilities using a 3-page patient packet. We will use each page to highlight various features:

Page	Document Type	OCR Feature
1	Patient Admission Form	Form elements - checkboxes, handwriting, unified unicode representation
2	Vital Signs Flowsheet	HTML table output - complex tables with rowspan/colspan
3	Foot X-ray	Image annotations - embedded images with descriptions

> Note: Sample data is synthetic/anonymized.

2.1 Setup: Load and Process Document

First, let's encode the PDF and run OCR on the full packet. We'll then explore each page's output.

import base64

def encode_pdf(pdf_path):
    """Encode the pdf to base64."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:
        print(f"Error: {e}")
        return None

import base64

def encode_pdf(pdf_path):
    """Encode the pdf to base64."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:
        print(f"Error: {e}")
        return None

Process the full document and get the OCR output:

import json

# Getting the base64 string
base64_pdf = encode_pdf(pdf_path)

# Call the OCR API
pdf_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_pdf}"
    },
    include_image_base64=True,
    table_format="html" #Specify HTML format to render complex table formats
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

import json

# Getting the base64 string
base64_pdf = encode_pdf(pdf_path)

# Call the OCR API
pdf_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_pdf}"
    },
    include_image_base64=True,
    table_format="html" #Specify HTML format to render complex table formats
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

print(json.dumps(response_dict, indent=4)[0:1000]) # check the first 1000 characters

print(json.dumps(response_dict, indent=4)[0:1000]) # check the first 1000 characters

Let's stylize the output for easier understanding

from IPython.display import display, HTML, Markdown
from mistralai.models import OCRResponse
from bs4 import BeautifulSoup

# CSS styling for tables (reusable constant)
TABLE_STYLE = """
<style>
    table {
        border-collapse: collapse;
        width: 100%;
        margin: 10px 0;
    }
    th, td {
        border: 1px solid black;
        padding: 8px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
    }
</style>
"""

def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    """Replace image placeholders in markdown with base64-encoded images."""
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(
            f"![{img_name}]({img_name})", f"<img src='{base64_str}' style='max-width:100%;'/>"
        )
    return markdown_str

def display_page_with_tables(page_index: int, ocr_data: dict, pdf_response: OCRResponse):
    """
    Display a page with styled HTML tables and embedded images.
    Tables are inserted inline at their original positions.
    Uses REST API response for tables, SDK response for images.

    Args:
        page_index: Index of the page to display (0-based)
        ocr_data: JSON data from REST API response
        pdf_response: OCRResponse object from SDK
    """
    if page_index >= len(ocr_data["pages"]):
        print(f"Page {page_index} not found")
        return

    page = ocr_data["pages"][page_index]
    markdown = page["markdown"]

    # Replace table placeholders with styled HTML tables (preserves order)
    # This specifically handles the format [tbl-X.html](tbl-X.html) where X is the table index
    if "tables" in page and page["tables"]:
        for table in page["tables"]:
            table_id = table.get("id", "")
            if table_id:
                # Replace the exact placeholder format from the OCR output
                placeholder = f"[{table_id}]({table_id})"
                styled_table = TABLE_STYLE + table["content"]
                markdown = markdown.replace(placeholder, styled_table)

    # Replace image placeholders with base64 from pdf_response
    if page_index < len(pdf_response.pages):
        for img in pdf_response.pages[page_index].images:
            markdown = markdown.replace(
                f"![{img.id}]({img.id})",
                f"<img src='{img.image_base64}' style='max-width:100%;'/>"
            )

    # Display as HTML with whitespace preservation
    display(HTML(f"<div style='white-space: pre-wrap;'>{markdown}</div>"))

def display_all_pages(ocr_data: dict, pdf_response: OCRResponse, pages=None):
    """
    Display all pages with styled HTML tables and images.

    Args:
        ocr_data: JSON data from REST API response
        pdf_response: OCRResponse object from SDK
        pages: List of page indices to display (None for all pages)
    """
    if pages is None:
        pages = range(len(ocr_data["pages"]))

    for i in pages:
        # Print page separator with proper newline handling
        print(f"\n{'='*60}")
        print(f"📄 PAGE {ocr_data['pages'][i]['index'] + 1}")
        print('='*60)

        # Display the page content with proper whitespace preservation
        display_page_with_tables(i, ocr_data, pdf_response)

        # Add spacing between pages for better readability
        print(f"\n{'\n'}")

from IPython.display import display, HTML, Markdown
from mistralai.models import OCRResponse
from bs4 import BeautifulSoup

# CSS styling for tables (reusable constant)
TABLE_STYLE = """
<style>
    table {
        border-collapse: collapse;
        width: 100%;
        margin: 10px 0;
    }
    th, td {
        border: 1px solid black;
        padding: 8px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
    }
</style>
"""

def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    """Replace image placeholders in markdown with base64-encoded images."""
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(
            f"![{img_name}]({img_name})", f"<img src='{base64_str}' style='max-width:100%;'/>"
        )
    return markdown_str

def display_page_with_tables(page_index: int, ocr_data: dict, pdf_response: OCRResponse):
    """
    Display a page with styled HTML tables and embedded images.
    Tables are inserted inline at their original positions.
    Uses REST API response for tables, SDK response for images.

    Args:
        page_index: Index of the page to display (0-based)
        ocr_data: JSON data from REST API response
        pdf_response: OCRResponse object from SDK
    """
    if page_index >= len(ocr_data["pages"]):
        print(f"Page {page_index} not found")
        return

    page = ocr_data["pages"][page_index]
    markdown = page["markdown"]

    # Replace table placeholders with styled HTML tables (preserves order)
    # This specifically handles the format [tbl-X.html](tbl-X.html) where X is the table index
    if "tables" in page and page["tables"]:
        for table in page["tables"]:
            table_id = table.get("id", "")
            if table_id:
                # Replace the exact placeholder format from the OCR output
                placeholder = f"[{table_id}]({table_id})"
                styled_table = TABLE_STYLE + table["content"]
                markdown = markdown.replace(placeholder, styled_table)

    # Replace image placeholders with base64 from pdf_response
    if page_index < len(pdf_response.pages):
        for img in pdf_response.pages[page_index].images:
            markdown = markdown.replace(
                f"![{img.id}]({img.id})",
                f"<img src='{img.image_base64}' style='max-width:100%;'/>"
            )

    # Display as HTML with whitespace preservation
    display(HTML(f"<div style='white-space: pre-wrap;'>{markdown}</div>"))

def display_all_pages(ocr_data: dict, pdf_response: OCRResponse, pages=None):
    """
    Display all pages with styled HTML tables and images.

    Args:
        ocr_data: JSON data from REST API response
        pdf_response: OCRResponse object from SDK
        pages: List of page indices to display (None for all pages)
    """
    if pages is None:
        pages = range(len(ocr_data["pages"]))

    for i in pages:
        # Print page separator with proper newline handling
        print(f"\n{'='*60}")
        print(f"📄 PAGE {ocr_data['pages'][i]['index'] + 1}")
        print('='*60)

        # Display the page content with proper whitespace preservation
        display_page_with_tables(i, ocr_data, pdf_response)

        # Add spacing between pages for better readability
        print(f"\n{'\n'}")

2.2 Form Elements: Checkboxes & Structured Fields (Page 1)

Page 1 contains a Patient Admission Form with checkboxes, handwriting, and fill-in lines. Mistral OCR 3 uses unified Unicode checkbox representation (☐ unchecked, ☑ checked) for consistent parsing.

# Display Page 1 - Form Elements
print("📄 PAGE 1: Patient Admission Form")
print("Notice: Checkboxes rendered as ☐ (unchecked) and ☑ (checked)\n")
display_page_with_tables(0, response_dict, pdf_response)

# Display Page 1 - Form Elements
print("📄 PAGE 1: Patient Admission Form")
print("Notice: Checkboxes rendered as ☐ (unchecked) and ☑ (checked)\n")
display_page_with_tables(0, response_dict, pdf_response)

2.3 HTML Table Output: Vital Signs Flowsheet (Page 2)

Page 2 contains a Vital Signs Flowsheet with complex table structures. Mistral OCR 3 gives the option to output tables as HTML with proper rowspan and colspan attributes, preserving the original structure for accurate data extraction.

# Display Page 2 - Vital Signs Flowsheet with HTML table
print("📄 PAGE 2: Vital Signs Flowsheet")
print("Notice: Tables output as HTML with rowspan/colspan preserved\n")
display_page_with_tables(1, response_dict, pdf_response)

# Display Page 2 - Vital Signs Flowsheet with HTML table
print("📄 PAGE 2: Vital Signs Flowsheet")
print("Notice: Tables output as HTML with rowspan/colspan preserved\n")
display_page_with_tables(1, response_dict, pdf_response)

2.4 Image Annotations: X-ray (Page 3)

Page 3 contains an X-ray image. Mistral OCR 3 can detect, extract, and annotate images within documents. The image is embedded in the markdown output with base64 encoding.

# Display Page 3 - X-ray Image
print("📄 PAGE 3: Foot X-ray")
print("Notice: Images are detected and embedded with base64 encoding\n")

page3 = pdf_response.pages[2]

# Show image metadata
print(f"Images detected on this page: {len(page3.images)}")
for img in page3.images:
    print(f"  - Image ID: {img.id}")
    print(f"    Dimensions: ({img.top_left_x}, {img.top_left_y}) to ({img.bottom_right_x}, {img.bottom_right_y})")

print("\n" + "="*60)
print("Rendered Output (with embedded X-ray image):")
print("="*60)

# Display page with images and tables
display_page_with_tables(2, response_dict, pdf_response)

# Display Page 3 - X-ray Image
print("📄 PAGE 3: Foot X-ray")
print("Notice: Images are detected and embedded with base64 encoding\n")

page3 = pdf_response.pages[2]

# Show image metadata
print(f"Images detected on this page: {len(page3.images)}")
for img in page3.images:
    print(f"  - Image ID: {img.id}")
    print(f"    Dimensions: ({img.top_left_x}, {img.top_left_y}) to ({img.bottom_right_x}, {img.bottom_right_y})")

print("\n" + "="*60)
print("Rendered Output (with embedded X-ray image):")
print("="*60)

# Display page with images and tables
display_page_with_tables(2, response_dict, pdf_response)

3. Document Intelligence Using Annotations

Moving from notebooks to production requires patterns for scale, reliability, and interoperability. This section demonstrates a progressive pipeline where we perform the following using DocAI Annotations:

Classification → Identify document types before extraction
Batch Processing → Handle multiple documents concurrently
FHIR Generation → Transform extracted data into healthcare-standard formats

Note: Refer to this cookbook for an intro to Annotations.

Production Patterns Setup - imports and utilities

import base64
import json
import time
import uuid
import os
from datetime import datetime
from enum import Enum
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from collections import Counter

from pydantic import BaseModel, Field, JsonValue
from mistralai import Mistral
from mistralai.extra import response_format_from_pydantic_model
from google.colab import userdata

api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key

if not api_key:
    print("⚠️  WARNING: No API key found!")
    print("   Set MISTRAL_API_KEY environment variable, or")
    print("   Uncomment and set api_key directly above")
else:
    client = Mistral(api_key=api_key)
    print("✅ Mistral client initialized")

def encode_pdf(pdf_path: str) -> Optional[str]:
    """Encode a PDF file to base64 string."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:
        print(f"Error: {e}")
        return None

print("✅ Production patterns setup complete")

import base64
import json
import time
import uuid
import os
from datetime import datetime
from enum import Enum
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from collections import Counter

from pydantic import BaseModel, Field, JsonValue
from mistralai import Mistral
from mistralai.extra import response_format_from_pydantic_model
from google.colab import userdata

api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key

if not api_key:
    print("⚠️  WARNING: No API key found!")
    print("   Set MISTRAL_API_KEY environment variable, or")
    print("   Uncomment and set api_key directly above")
else:
    client = Mistral(api_key=api_key)
    print("✅ Mistral client initialized")

def encode_pdf(pdf_path: str) -> Optional[str]:
    """Encode a PDF file to base64 string."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:
        print(f"Error: {e}")
        return None

print("✅ Production patterns setup complete")

3.1 Document Classification and Intelligent Routing

Healthcare organizations receive mixed document types daily—faxes, scanned forms, digital PDFs. Before extraction, classify incoming documents to determine:

Document type (demographics, vitals, lab results, progress notes)
Routing destination (billing, clinical, pharmacy)
Processing urgency (stat vs routine)

This information will then inform which extraction schema to apply in the next section.

# Define classification schema for healthcare documents

# Possible values for incoming documents
class HealthcareDocumentType(str, Enum):
    PATIENT_DEMOGRAPHICS = "patient_demographics"
    PROGRESS_NOTES = "progress_notes"
    VITALS_FLOWSHEET = "vitals_flowsheet"
    LAB_RESULTS = "lab_results"
    MEDICATION_LIST = "medication_list"
    PRIOR_AUTHORIZATION = "prior_authorization"
    INSURANCE_CARD = "insurance_card"
    CONSENT_FORM = "consent_form"
    UNKNOWN = "unknown"

# Possible values for departments to route to
class RoutingDepartment(str, Enum):
    CLINICAL = "clinical"
    BILLING = "billing"
    PHARMACY = "pharmacy"
    RECORDS = "medical_records"
    INTAKE = "patient_intake"

# Classification schema
class DocumentClassification(BaseModel):
    document_type: HealthcareDocumentType = Field(..., description="The primary type of healthcare document")
    confidence: float = Field(..., description="Confidence score between 0.0 and 1.0")
    routing_department: RoutingDepartment = Field(..., description="Department that should handle this document")
    urgency: str = Field(..., description="Processing priority: 'stat', 'urgent', or 'routine'")
    key_identifiers_found: List[str] = Field(default=[], description="Patient identifiers detected (e.g., 'MRN', 'DOB', 'Name')")
    requires_signature: bool = Field(default=False, description="Whether document requires/contains signatures")
    summary: str = Field(..., description="One-sentence summary of document contents")

# Define classification schema for healthcare documents

# Possible values for incoming documents
class HealthcareDocumentType(str, Enum):
    PATIENT_DEMOGRAPHICS = "patient_demographics"
    PROGRESS_NOTES = "progress_notes"
    VITALS_FLOWSHEET = "vitals_flowsheet"
    LAB_RESULTS = "lab_results"
    MEDICATION_LIST = "medication_list"
    PRIOR_AUTHORIZATION = "prior_authorization"
    INSURANCE_CARD = "insurance_card"
    CONSENT_FORM = "consent_form"
    UNKNOWN = "unknown"

# Possible values for departments to route to
class RoutingDepartment(str, Enum):
    CLINICAL = "clinical"
    BILLING = "billing"
    PHARMACY = "pharmacy"
    RECORDS = "medical_records"
    INTAKE = "patient_intake"

# Classification schema
class DocumentClassification(BaseModel):
    document_type: HealthcareDocumentType = Field(..., description="The primary type of healthcare document")
    confidence: float = Field(..., description="Confidence score between 0.0 and 1.0")
    routing_department: RoutingDepartment = Field(..., description="Department that should handle this document")
    urgency: str = Field(..., description="Processing priority: 'stat', 'urgent', or 'routine'")
    key_identifiers_found: List[str] = Field(default=[], description="Patient identifiers detected (e.g., 'MRN', 'DOB', 'Name')")
    requires_signature: bool = Field(default=False, description="Whether document requires/contains signatures")
    summary: str = Field(..., description="One-sentence summary of document contents")

# Classify the patient packet document

base64_packet = encode_pdf(pdf_path)

# First-pass classification using only page 1
classification_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_packet}"
    },
    pages=list(range(8)),  # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit
    document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
    include_image_base64=False
)

# Parse and display classification
classification = json.loads(classification_response.document_annotation)
print("📋 Document Classification Results")
print("=" * 50)
print(f"Type:        {classification['document_type']}")
print(f"Confidence:  {classification['confidence']:.0%}")
print(f"Route to:    {classification['routing_department']}")
print(f"Urgency:     {classification['urgency']}")
print(f"Identifiers: {', '.join(classification['key_identifiers_found'])}")
print(f"Signature:   {'Yes' if classification['requires_signature'] else 'No'}")
print(f"\nSummary: {classification['summary']}")

# Classify the patient packet document

base64_packet = encode_pdf(pdf_path)

# First-pass classification using only page 1
classification_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_packet}"
    },
    pages=list(range(8)),  # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit
    document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
    include_image_base64=False
)

# Parse and display classification
classification = json.loads(classification_response.document_annotation)
print("📋 Document Classification Results")
print("=" * 50)
print(f"Type:        {classification['document_type']}")
print(f"Confidence:  {classification['confidence']:.0%}")
print(f"Route to:    {classification['routing_department']}")
print(f"Urgency:     {classification['urgency']}")
print(f"Identifiers: {', '.join(classification['key_identifiers_found'])}")
print(f"Signature:   {'Yes' if classification['requires_signature'] else 'No'}")
print(f"\nSummary: {classification['summary']}")

3.2 Classification-Based Schema Extraction

Based on the classfication defined in the previous step, we'll extract different data elements from each classified document.

# Define type-specific extraction schemas for each document type

class PatientDemographics(BaseModel):
    """Schema for patient demographics and intake forms."""
    patient_name: str = Field(..., description="Full patient name")
    date_of_birth: str = Field(..., description="DOB in MM/DD/YYYY format")
    gender: Optional[str] = Field(None, description="Patient gender")
    address: Optional[str] = Field(None, description="Patient address")
    phone: Optional[str] = Field(None, description="Contact phone number")
    insurance_id: Optional[str] = Field(None, description="Insurance member ID")
    emergency_contact: Optional[str] = Field(None, description="Emergency contact info")

class VitalsFlowsheet(BaseModel):
    """Schema for vital signs flowsheets."""
    date_recorded: str = Field(..., description="Date vitals were recorded")
    blood_pressure: Optional[str] = Field(None, description="Blood pressure reading (systolic/diastolic)")
    heart_rate: Optional[str] = Field(None, description="Heart rate in BPM")
    temperature: Optional[str] = Field(None, description="Body temperature")
    respiratory_rate: Optional[str] = Field(None, description="Respiratory rate")
    oxygen_saturation: Optional[str] = Field(None, description="SpO2 percentage")
    weight: Optional[str] = Field(None, description="Patient weight")
    height: Optional[str] = Field(None, description="Patient height")

# Map document types to their extraction schemas
EXTRACTION_SCHEMAS: Dict[str, type] = {
    "patient_demographics": PatientDemographics,
    "vitals_flowsheet": VitalsFlowsheet,
    # Add more mappings as needed
}

print("✅ Defined extraction schemas for:", list(EXTRACTION_SCHEMAS.keys()))

# Define type-specific extraction schemas for each document type

class PatientDemographics(BaseModel):
    """Schema for patient demographics and intake forms."""
    patient_name: str = Field(..., description="Full patient name")
    date_of_birth: str = Field(..., description="DOB in MM/DD/YYYY format")
    gender: Optional[str] = Field(None, description="Patient gender")
    address: Optional[str] = Field(None, description="Patient address")
    phone: Optional[str] = Field(None, description="Contact phone number")
    insurance_id: Optional[str] = Field(None, description="Insurance member ID")
    emergency_contact: Optional[str] = Field(None, description="Emergency contact info")

class VitalsFlowsheet(BaseModel):
    """Schema for vital signs flowsheets."""
    date_recorded: str = Field(..., description="Date vitals were recorded")
    blood_pressure: Optional[str] = Field(None, description="Blood pressure reading (systolic/diastolic)")
    heart_rate: Optional[str] = Field(None, description="Heart rate in BPM")
    temperature: Optional[str] = Field(None, description="Body temperature")
    respiratory_rate: Optional[str] = Field(None, description="Respiratory rate")
    oxygen_saturation: Optional[str] = Field(None, description="SpO2 percentage")
    weight: Optional[str] = Field(None, description="Patient weight")
    height: Optional[str] = Field(None, description="Patient height")

# Map document types to their extraction schemas
EXTRACTION_SCHEMAS: Dict[str, type] = {
    "patient_demographics": PatientDemographics,
    "vitals_flowsheet": VitalsFlowsheet,
    # Add more mappings as needed
}

print("✅ Defined extraction schemas for:", list(EXTRACTION_SCHEMAS.keys()))

3.3 Combine Classification and Extraction with Batch Processing

Production systems process hundreds of documents daily. This pattern extends classification & extraction to handle multi-page packets where each page may be a different document type (demographics, vitals, lab results, etc.).

@dataclass
class PageResult:
    """Result container for a single processed page"""
    page_index: int
    document_type: str
    classification: Dict[str, Any]
    extracted_data: Optional[Dict[str, Any]]
    markdown_content: str
    status: str  # "success", "error", "skipped"
    error_message: Optional[str] = None
    processing_time_ms: float = 0

def process_patient_packet(pdf_path: str, rate_limit_delay: float = 0.5) -> List[PageResult]:
    """
    Process a multi-page patient packet with per-page classification and extraction.

    Args:
        pdf_path: Path to the PDF file
        rate_limit_delay: Delay between API calls to respect rate limits

    Returns:
        List of PageResult objects with classification and extracted data
    """
    base64_pdf = encode_pdf(pdf_path)
    results = []

    # First, get the full document to know page count
    full_response = client.ocr.process(
        model="mistral-ocr-latest",
        document={
            "type": "document_url",
            "document_url": f"data:application/pdf;base64,{base64_pdf}"
        },
        include_image_base64=False
    )

    num_pages = len(full_response.pages)
    print(f"📄 Processing {num_pages} pages from {pdf_path}")
    print("-" * 50)

    for page_idx in range(num_pages):
        start_time = time.time()

        try:
            # Step 1: Classify this page
            time.sleep(rate_limit_delay)  # Rate limiting

            classify_response = client.ocr.process(
                model="mistral-ocr-latest",
                document={
                    "type": "document_url",
                    "document_url": f"data:application/pdf;base64,{base64_pdf}"
                },
                pages=[page_idx],
                document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
                include_image_base64=False
            )

            page_classification = json.loads(classify_response.document_annotation)
            doc_type = page_classification["document_type"]

            # Step 2: Extract with type-specific schema if available
            extracted_data = None
            if doc_type in EXTRACTION_SCHEMAS:
                time.sleep(rate_limit_delay)

                extract_response = client.ocr.process(
                    model="mistral-ocr-latest",
                    document={
                        "type": "document_url",
                        "document_url": f"data:application/pdf;base64,{base64_pdf}"
                    },
                    pages=[page_idx],
                    document_annotation_format=response_format_from_pydantic_model(EXTRACTION_SCHEMAS[doc_type]),
                    include_image_base64=False
                )
                extracted_data = json.loads(extract_response.document_annotation)

            processing_time = (time.time() - start_time) * 1000

            result = PageResult(
                page_index=page_idx,
                document_type=doc_type,
                classification=page_classification,
                extracted_data=extracted_data,
                markdown_content=full_response.pages[page_idx].markdown,
                status="success",
                processing_time_ms=processing_time
            )

            print(f"  ✅ Page {page_idx + 1}: {doc_type} ({processing_time:.0f}ms)")

        except Exception as e:
            result = PageResult(
                page_index=page_idx,
                document_type="unknown",
                classification={},
                extracted_data=None,
                markdown_content="",
                status="error",
                error_message=str(e),
                processing_time_ms=(time.time() - start_time) * 1000
            )
            print(f"  ❌ Page {page_idx + 1}: Error - {str(e)[:50]}")

        results.append(result)

    return results

# Process the patient packet
batch_results = process_patient_packet("patient-packet-completed.pdf")

# Summary statistics
print("\n" + "=" * 50)
print("📊 BATCH PROCESSING SUMMARY")
print("=" * 50)
successful = [r for r in batch_results if r.status == "success"]
failed = [r for r in batch_results if r.status == "error"]

print(f"Total pages:     {len(batch_results)}")
print(f"Successful:      {len(successful)}")
print(f"Failed:          {len(failed)}")
print(f"Total time:      {sum(r.processing_time_ms for r in batch_results):.0f}ms")

# Group by document type
doc_types = Counter(r.document_type for r in successful)
print(f"\nDocument types found:")
for doc_type, count in doc_types.items():
    print(f"  • {doc_type}: {count} page(s)")

# Display extracted data for each page
print("📋 EXTRACTED DATA BY PAGE")
print("=" * 50)

for result in batch_results:
    if result.status == "success":
        print(f"\n🔹 Page {result.page_index + 1}: {result.document_type}")
        print(f"   Confidence: {result.classification.get('confidence', 'N/A'):.0%}")
        print(f"   Route to: {result.classification.get('routing_department', 'N/A')}")

        if result.extracted_data:
            print("   Extracted fields:")
            for key, value in result.extracted_data.items():
                if value:  # Only show non-null values
                    print(f"     • {key}: {value}")

@dataclass
class PageResult:
    """Result container for a single processed page"""
    page_index: int
    document_type: str
    classification: Dict[str, Any]
    extracted_data: Optional[Dict[str, Any]]
    markdown_content: str
    status: str  # "success", "error", "skipped"
    error_message: Optional[str] = None
    processing_time_ms: float = 0

def process_patient_packet(pdf_path: str, rate_limit_delay: float = 0.5) -> List[PageResult]:
    """
    Process a multi-page patient packet with per-page classification and extraction.

    Args:
        pdf_path: Path to the PDF file
        rate_limit_delay: Delay between API calls to respect rate limits

    Returns:
        List of PageResult objects with classification and extracted data
    """
    base64_pdf = encode_pdf(pdf_path)
    results = []

    # First, get the full document to know page count
    full_response = client.ocr.process(
        model="mistral-ocr-latest",
        document={
            "type": "document_url",
            "document_url": f"data:application/pdf;base64,{base64_pdf}"
        },
        include_image_base64=False
    )

    num_pages = len(full_response.pages)
    print(f"📄 Processing {num_pages} pages from {pdf_path}")
    print("-" * 50)

    for page_idx in range(num_pages):
        start_time = time.time()

        try:
            # Step 1: Classify this page
            time.sleep(rate_limit_delay)  # Rate limiting

            classify_response = client.ocr.process(
                model="mistral-ocr-latest",
                document={
                    "type": "document_url",
                    "document_url": f"data:application/pdf;base64,{base64_pdf}"
                },
                pages=[page_idx],
                document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
                include_image_base64=False
            )

            page_classification = json.loads(classify_response.document_annotation)
            doc_type = page_classification["document_type"]

            # Step 2: Extract with type-specific schema if available
            extracted_data = None
            if doc_type in EXTRACTION_SCHEMAS:
                time.sleep(rate_limit_delay)

                extract_response = client.ocr.process(
                    model="mistral-ocr-latest",
                    document={
                        "type": "document_url",
                        "document_url": f"data:application/pdf;base64,{base64_pdf}"
                    },
                    pages=[page_idx],
                    document_annotation_format=response_format_from_pydantic_model(EXTRACTION_SCHEMAS[doc_type]),
                    include_image_base64=False
                )
                extracted_data = json.loads(extract_response.document_annotation)

            processing_time = (time.time() - start_time) * 1000

            result = PageResult(
                page_index=page_idx,
                document_type=doc_type,
                classification=page_classification,
                extracted_data=extracted_data,
                markdown_content=full_response.pages[page_idx].markdown,
                status="success",
                processing_time_ms=processing_time
            )

            print(f"  ✅ Page {page_idx + 1}: {doc_type} ({processing_time:.0f}ms)")

        except Exception as e:
            result = PageResult(
                page_index=page_idx,
                document_type="unknown",
                classification={},
                extracted_data=None,
                markdown_content="",
                status="error",
                error_message=str(e),
                processing_time_ms=(time.time() - start_time) * 1000
            )
            print(f"  ❌ Page {page_idx + 1}: Error - {str(e)[:50]}")

        results.append(result)

    return results

# Process the patient packet
batch_results = process_patient_packet("patient-packet-completed.pdf")

# Summary statistics
print("\n" + "=" * 50)
print("📊 BATCH PROCESSING SUMMARY")
print("=" * 50)
successful = [r for r in batch_results if r.status == "success"]
failed = [r for r in batch_results if r.status == "error"]

print(f"Total pages:     {len(batch_results)}")
print(f"Successful:      {len(successful)}")
print(f"Failed:          {len(failed)}")
print(f"Total time:      {sum(r.processing_time_ms for r in batch_results):.0f}ms")

# Group by document type
doc_types = Counter(r.document_type for r in successful)
print(f"\nDocument types found:")
for doc_type, count in doc_types.items():
    print(f"  • {doc_type}: {count} page(s)")

# Display extracted data for each page
print("📋 EXTRACTED DATA BY PAGE")
print("=" * 50)

for result in batch_results:
    if result.status == "success":
        print(f"\n🔹 Page {result.page_index + 1}: {result.document_type}")
        print(f"   Confidence: {result.classification.get('confidence', 'N/A'):.0%}")
        print(f"   Route to: {result.classification.get('routing_department', 'N/A')}")

        if result.extracted_data:
            print("   Extracted fields:")
            for key, value in result.extracted_data.items():
                if value:  # Only show non-null values
                    print(f"     • {key}: {value}")

3.4 FHIR Resource Generation

Extracted data is only valuable if it integrates with clinical systems. FHIR (Fast Healthcare Interoperability Resources) is the industry standard for healthcare data exchange, supported by Epic, Cerner, and all major EHRs.

This pattern transforms our batch-extracted data into FHIR R4 resources:

Patient → Demographics from intake forms
Observation → Vital signs measurements

The resulting FHIR Bundle can be POSTed to any FHIR-compliant system.

# FHIR R4 Resource Generation from extracted OCR data

def generate_fhir_patient(demographics: Dict[str, Any]) -> Dict[str, Any]:
    """Convert extracted demographics to FHIR R4 Patient resource"""
    # Parse name (assumes "Last, First" or "First Last" format)
    name_parts = demographics.get("patient_name", "Unknown").replace(",", " ").split()

    patient = {
        "resourceType": "Patient",
        "id": str(uuid.uuid4()),
        "meta": {
            "profile": ["http://hl7.org/fhir/us/core/StructureDefinition/us-core-patient"]
        },
        "identifier": [{
            "system": "urn:oid:2.16.840.1.113883.4.1",  # Example OID
            "value": demographics.get("insurance_id", "UNKNOWN")
        }],
        "name": [{
            "use": "official",
            "family": name_parts[0] if name_parts else "Unknown",
            "given": name_parts[1:] if len(name_parts) > 1 else []
        }],
        "birthDate": convert_date_to_fhir(demographics.get("date_of_birth")),
        "gender": map_gender(demographics.get("gender")),
    }

    # Add address if present
    if demographics.get("address"):
        patient["address"] = [{
            "use": "home",
            "text": demographics["address"]
        }]

    # Add phone if present
    if demographics.get("phone"):
        patient["telecom"] = [{
            "system": "phone",
            "value": demographics["phone"],
            "use": "home"
        }]

    return patient

def generate_fhir_vitals(vitals: Dict[str, Any], patient_id: str) -> List[Dict[str, Any]]:
    """Convert extracted vitals to FHIR R4 Observation resources"""
    observations = []

    # LOINC codes for common vitals
    vital_mappings = {
        "blood_pressure": {"code": "85354-9", "display": "Blood pressure panel"},
        "heart_rate": {"code": "8867-4", "display": "Heart rate", "unit": "/min"},
        "temperature": {"code": "8310-5", "display": "Body temperature", "unit": "Cel"},
        "respiratory_rate": {"code": "9279-1", "display": "Respiratory rate", "unit": "/min"},
        "oxygen_saturation": {"code": "2708-6", "display": "Oxygen saturation", "unit": "%"},
        "weight": {"code": "29463-7", "display": "Body weight", "unit": "kg"},
        "height": {"code": "8302-2", "display": "Body height", "unit": "cm"}
    }

    effective_date = convert_date_to_fhir(vitals.get("date_recorded")) or datetime.now().strftime("%Y-%m-%d")

    for vital_key, loinc in vital_mappings.items():
        value = vitals.get(vital_key)
        if value:
            observation = {
                "resourceType": "Observation",
                "id": str(uuid.uuid4()),
                "status": "final",
                "category": [{
                    "coding": [{
                        "system": "http://terminology.hl7.org/CodeSystem/observation-category",
                        "code": "vital-signs",
                        "display": "Vital Signs"
                    }]
                }],
                "code": {
                    "coding": [{
                        "system": "http://loinc.org",
                        "code": loinc["code"],
                        "display": loinc["display"]
                    }]
                },
                "subject": {"reference": f"Patient/{patient_id}"},
                "effectiveDateTime": effective_date,
                "valueString": str(value)  # Using string for flexibility; production would parse numeric
            }
            observations.append(observation)

    return observations

# Helper functions
def convert_date_to_fhir(date_str: Optional[str]) -> Optional[str]:
    """Convert various date formats to FHIR format (YYYY-MM-DD)"""
    if not date_str:
        return None
    # Handle MM/DD/YYYY format
    try:
        parts = date_str.replace("-", "/").split("/")
        if len(parts) == 3:
            if len(parts[0]) == 4:  # Already YYYY-MM-DD
                return date_str
            return f"{parts[2]}-{parts[0].zfill(2)}-{parts[1].zfill(2)}"
    except:
        pass
    return date_str

def map_gender(gender_str: Optional[str]) -> str:
    """Map various gender representations to FHIR values"""
    if not gender_str:
        return "unknown"
    g = gender_str.lower().strip()
    if g in ["m", "male"]:
        return "male"
    elif g in ["f", "female"]:
        return "female"
    return "unknown"

print("✅ FHIR resource generators ready")

# FHIR R4 Resource Generation from extracted OCR data

def generate_fhir_patient(demographics: Dict[str, Any]) -> Dict[str, Any]:
    """Convert extracted demographics to FHIR R4 Patient resource"""
    # Parse name (assumes "Last, First" or "First Last" format)
    name_parts = demographics.get("patient_name", "Unknown").replace(",", " ").split()

    patient = {
        "resourceType": "Patient",
        "id": str(uuid.uuid4()),
        "meta": {
            "profile": ["http://hl7.org/fhir/us/core/StructureDefinition/us-core-patient"]
        },
        "identifier": [{
            "system": "urn:oid:2.16.840.1.113883.4.1",  # Example OID
            "value": demographics.get("insurance_id", "UNKNOWN")
        }],
        "name": [{
            "use": "official",
            "family": name_parts[0] if name_parts else "Unknown",
            "given": name_parts[1:] if len(name_parts) > 1 else []
        }],
        "birthDate": convert_date_to_fhir(demographics.get("date_of_birth")),
        "gender": map_gender(demographics.get("gender")),
    }

    # Add address if present
    if demographics.get("address"):
        patient["address"] = [{
            "use": "home",
            "text": demographics["address"]
        }]

    # Add phone if present
    if demographics.get("phone"):
        patient["telecom"] = [{
            "system": "phone",
            "value": demographics["phone"],
            "use": "home"
        }]

    return patient

def generate_fhir_vitals(vitals: Dict[str, Any], patient_id: str) -> List[Dict[str, Any]]:
    """Convert extracted vitals to FHIR R4 Observation resources"""
    observations = []

    # LOINC codes for common vitals
    vital_mappings = {
        "blood_pressure": {"code": "85354-9", "display": "Blood pressure panel"},
        "heart_rate": {"code": "8867-4", "display": "Heart rate", "unit": "/min"},
        "temperature": {"code": "8310-5", "display": "Body temperature", "unit": "Cel"},
        "respiratory_rate": {"code": "9279-1", "display": "Respiratory rate", "unit": "/min"},
        "oxygen_saturation": {"code": "2708-6", "display": "Oxygen saturation", "unit": "%"},
        "weight": {"code": "29463-7", "display": "Body weight", "unit": "kg"},
        "height": {"code": "8302-2", "display": "Body height", "unit": "cm"}
    }

    effective_date = convert_date_to_fhir(vitals.get("date_recorded")) or datetime.now().strftime("%Y-%m-%d")

    for vital_key, loinc in vital_mappings.items():
        value = vitals.get(vital_key)
        if value:
            observation = {
                "resourceType": "Observation",
                "id": str(uuid.uuid4()),
                "status": "final",
                "category": [{
                    "coding": [{
                        "system": "http://terminology.hl7.org/CodeSystem/observation-category",
                        "code": "vital-signs",
                        "display": "Vital Signs"
                    }]
                }],
                "code": {
                    "coding": [{
                        "system": "http://loinc.org",
                        "code": loinc["code"],
                        "display": loinc["display"]
                    }]
                },
                "subject": {"reference": f"Patient/{patient_id}"},
                "effectiveDateTime": effective_date,
                "valueString": str(value)  # Using string for flexibility; production would parse numeric
            }
            observations.append(observation)

    return observations

# Helper functions
def convert_date_to_fhir(date_str: Optional[str]) -> Optional[str]:
    """Convert various date formats to FHIR format (YYYY-MM-DD)"""
    if not date_str:
        return None
    # Handle MM/DD/YYYY format
    try:
        parts = date_str.replace("-", "/").split("/")
        if len(parts) == 3:
            if len(parts[0]) == 4:  # Already YYYY-MM-DD
                return date_str
            return f"{parts[2]}-{parts[0].zfill(2)}-{parts[1].zfill(2)}"
    except:
        pass
    return date_str

def map_gender(gender_str: Optional[str]) -> str:
    """Map various gender representations to FHIR values"""
    if not gender_str:
        return "unknown"
    g = gender_str.lower().strip()
    if g in ["m", "male"]:
        return "male"
    elif g in ["f", "female"]:
        return "female"
    return "unknown"

print("✅ FHIR resource generators ready")

def create_fhir_bundle_from_batch(batch_results: List[PageResult]) -> Dict[str, Any]:
    """
    Create a FHIR Bundle from batch-processed OCR results.

    Args:
        batch_results: Results from process_patient_packet()

    Returns:
        FHIR R4 Bundle resource ready for EHR integration
    """
    bundle = {
        "resourceType": "Bundle",
        "id": str(uuid.uuid4()),
        "type": "transaction",
        "timestamp": datetime.now().isoformat(),
        "entry": []
    }

    patient_id = None

    for result in batch_results:
        if result.status != "success" or not result.extracted_data:
            continue

        doc_type = result.document_type
        data = result.extracted_data

        # Generate Patient resource from demographics
        if doc_type == "patient_demographics":
            patient_resource = generate_fhir_patient(data)
            patient_id = patient_resource["id"]
            bundle["entry"].append({
                "fullUrl": f"urn:uuid:{patient_id}",
                "resource": patient_resource,
                "request": {
                    "method": "POST",
                    "url": "Patient"
                }
            })

        # Generate Observations from vitals
        elif doc_type == "vitals_flowsheet" and patient_id:
            observations = generate_fhir_vitals(data, patient_id)
            for obs in observations:
                bundle["entry"].append({
                    "fullUrl": f"urn:uuid:{obs['id']}",
                    "resource": obs,
                    "request": {
                        "method": "POST",
                        "url": "Observation"
                    }
                })

    return bundle

# Generate FHIR Bundle from our batch results
fhir_bundle = create_fhir_bundle_from_batch(batch_results)

print("🏥 FHIR BUNDLE GENERATED")
print("=" * 50)
print(f"Bundle ID: {fhir_bundle['id']}")
print(f"Bundle Type: {fhir_bundle['type']}")
print(f"Total Resources: {len(fhir_bundle['entry'])}")
print(f"\nResources by type:")
resource_types = Counter(e["resource"]["resourceType"] for e in fhir_bundle["entry"])
for rtype, count in resource_types.items():
    print(f"  • {rtype}: {count}")


# Display the full FHIR Bundle (ready to POST to an EHR)
print("📄 FHIR BUNDLE JSON (Ready for EHR Integration)")
print("=" * 50)
print(json.dumps(fhir_bundle, indent=2))

def create_fhir_bundle_from_batch(batch_results: List[PageResult]) -> Dict[str, Any]:
    """
    Create a FHIR Bundle from batch-processed OCR results.

    Args:
        batch_results: Results from process_patient_packet()

    Returns:
        FHIR R4 Bundle resource ready for EHR integration
    """
    bundle = {
        "resourceType": "Bundle",
        "id": str(uuid.uuid4()),
        "type": "transaction",
        "timestamp": datetime.now().isoformat(),
        "entry": []
    }

    patient_id = None

    for result in batch_results:
        if result.status != "success" or not result.extracted_data:
            continue

        doc_type = result.document_type
        data = result.extracted_data

        # Generate Patient resource from demographics
        if doc_type == "patient_demographics":
            patient_resource = generate_fhir_patient(data)
            patient_id = patient_resource["id"]
            bundle["entry"].append({
                "fullUrl": f"urn:uuid:{patient_id}",
                "resource": patient_resource,
                "request": {
                    "method": "POST",
                    "url": "Patient"
                }
            })

        # Generate Observations from vitals
        elif doc_type == "vitals_flowsheet" and patient_id:
            observations = generate_fhir_vitals(data, patient_id)
            for obs in observations:
                bundle["entry"].append({
                    "fullUrl": f"urn:uuid:{obs['id']}",
                    "resource": obs,
                    "request": {
                        "method": "POST",
                        "url": "Observation"
                    }
                })

    return bundle

# Generate FHIR Bundle from our batch results
fhir_bundle = create_fhir_bundle_from_batch(batch_results)

print("🏥 FHIR BUNDLE GENERATED")
print("=" * 50)
print(f"Bundle ID: {fhir_bundle['id']}")
print(f"Bundle Type: {fhir_bundle['type']}")
print(f"Total Resources: {len(fhir_bundle['entry'])}")
print(f"\nResources by type:")
resource_types = Counter(e["resource"]["resourceType"] for e in fhir_bundle["entry"])
for rtype, count in resource_types.items():
    print(f"  • {rtype}: {count}")


# Display the full FHIR Bundle (ready to POST to an EHR)
print("📄 FHIR BUNDLE JSON (Ready for EHR Integration)")
print("=" * 50)
print(json.dumps(fhir_bundle, indent=2))

4. Summary & Next Steps

This cookbook demonstrated a complete production pipeline for healthcare OCR:

Pattern	Purpose	Key Benefit
Classification	Identify document types	Route to correct extraction schema
Batch Processing	Handle multi-page packets	Scale with rate limiting & error isolation
FHIR Generation	Convert to healthcare standards	Integrate with Epic, Cerner, any FHIR EHR

Production enhancements to consider:

Confidence thresholds → Flag low-confidence extractions for human review
Async processing → Use asyncio for higher throughput
Audit logging → Track PHI access for HIPAA compliance
FHIR validation → Validate bundles against US Core profiles before submission
Webhook notifications → Alert downstream systems when processing completes

Enterprise Document AI with Mistral: Healthcare Document Processing