Healthcare generates 30% of the world's data, yet much of it remains locked in PDFs, scanned faxes, and unstructured documents. As regulations like the CMS prior authorization mandate push toward digital-first operations and hospital staffing shortages intensify, automated document processing has become critical infrastructure, not just for patient intake, but for back office operations like invoice management, medical billing and coding, and clinical documentation at scale.
Key challenges driving Document AI adoption:
- 30% of global data originates in healthcare, mostly unstructured
- Legacy systems rely on paper, fax, and non-digital formats
- Regulatory pressure (CMS mandates, interoperability requirements)
- Severe staffing shortages across clinical and administrative roles
Mistral OCR 3 handles intricate healthcare documents—handwritten notes, nested lab tables, checkboxes, and multi-page forms—with accuracy comparable to commercial solutions at a fraction of the cost. This cookbook demonstrates how to get started.
You can also interactively explore Document AI in AI Studio
1. Setup
First, let's install mistralai and download the document.
%%capture
!pip install mistralaiSample Document
This cookbook uses patient-packet-completed.pdf - a synthetic multi-page patient packet containing demographics, vitals, and clinical notes.
%%capture
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/hcls/patient-packet-completed.pdf# Verify sample document exists
import os
# Path to your pdf (using local file)
pdf_path = "patient-packet-completed.pdf"
if os.path.exists(pdf_path):
print(f"✅ Found: {pdf_path}")
print(f" Size: {os.path.getsize(pdf_path) / 1024:.1f} KB")
else:
print(f"❌ File not found: {pdf_path}")
print(" Please ensure patient-packet-completed.pdf is in the working directory")
# List available sample files in the workspace
!ls -la *.pdf 2>/dev/null || echo "No PDF files found in current directory"Create Client
We will need to set up our client. You can create an API key on AI Studio.
# Initialize Mistral client with API key
import os
from mistralai import Mistral
from google.colab import userdata
import requests
api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key
if not api_key:
print("⚠️ WARNING: No API key found!")
else:
client = Mistral(api_key=api_key)
print("✅ Mistral client initialized")2. Use Case: Patient Medical Record Packet OCR Processing
This section showcases Mistral OCR 3 capabilities using a 3-page patient packet. We will use each page to highlight various features:
| Page | Document Type | OCR Feature |
|---|---|---|
| 1 | Patient Admission Form | Form elements - checkboxes, handwriting, unified unicode representation |
| 2 | Vital Signs Flowsheet | HTML table output - complex tables with rowspan/colspan |
| 3 | Foot X-ray | Image annotations - embedded images with descriptions |
> Note: Sample data is synthetic/anonymized.
2.1 Setup: Load and Process Document
First, let's encode the PDF and run OCR on the full packet. We'll then explore each page's output.
import base64
def encode_pdf(pdf_path):
"""Encode the pdf to base64."""
try:
with open(pdf_path, "rb") as pdf_file:
return base64.b64encode(pdf_file.read()).decode('utf-8')
except FileNotFoundError:
print(f"Error: The file {pdf_path} was not found.")
return None
except Exception as e:
print(f"Error: {e}")
return NoneProcess the full document and get the OCR output:
import json
# Getting the base64 string
base64_pdf = encode_pdf(pdf_path)
# Call the OCR API
pdf_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{base64_pdf}"
},
include_image_base64=True,
table_format="html" #Specify HTML format to render complex table formats
)
# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())print(json.dumps(response_dict, indent=4)[0:1000]) # check the first 1000 charactersLet's stylize the output for easier understanding
from IPython.display import display, HTML, Markdown
from mistralai.models import OCRResponse
from bs4 import BeautifulSoup
# CSS styling for tables (reusable constant)
TABLE_STYLE = """
<style>
table {
border-collapse: collapse;
width: 100%;
margin: 10px 0;
}
th, td {
border: 1px solid black;
padding: 8px;
text-align: left;
}
th {
background-color: #f2f2f2;
}
</style>
"""
def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
"""Replace image placeholders in markdown with base64-encoded images."""
for img_name, base64_str in images_dict.items():
markdown_str = markdown_str.replace(
f"", f"<img src='{base64_str}' style='max-width:100%;'/>"
)
return markdown_str
def display_page_with_tables(page_index: int, ocr_data: dict, pdf_response: OCRResponse):
"""
Display a page with styled HTML tables and embedded images.
Tables are inserted inline at their original positions.
Uses REST API response for tables, SDK response for images.
Args:
page_index: Index of the page to display (0-based)
ocr_data: JSON data from REST API response
pdf_response: OCRResponse object from SDK
"""
if page_index >= len(ocr_data["pages"]):
print(f"Page {page_index} not found")
return
page = ocr_data["pages"][page_index]
markdown = page["markdown"]
# Replace table placeholders with styled HTML tables (preserves order)
# This specifically handles the format [tbl-X.html](tbl-X.html) where X is the table index
if "tables" in page and page["tables"]:
for table in page["tables"]:
table_id = table.get("id", "")
if table_id:
# Replace the exact placeholder format from the OCR output
placeholder = f"[{table_id}]({table_id})"
styled_table = TABLE_STYLE + table["content"]
markdown = markdown.replace(placeholder, styled_table)
# Replace image placeholders with base64 from pdf_response
if page_index < len(pdf_response.pages):
for img in pdf_response.pages[page_index].images:
markdown = markdown.replace(
f"",
f"<img src='{img.image_base64}' style='max-width:100%;'/>"
)
# Display as HTML with whitespace preservation
display(HTML(f"<div style='white-space: pre-wrap;'>{markdown}</div>"))
def display_all_pages(ocr_data: dict, pdf_response: OCRResponse, pages=None):
"""
Display all pages with styled HTML tables and images.
Args:
ocr_data: JSON data from REST API response
pdf_response: OCRResponse object from SDK
pages: List of page indices to display (None for all pages)
"""
if pages is None:
pages = range(len(ocr_data["pages"]))
for i in pages:
# Print page separator with proper newline handling
print(f"\n{'='*60}")
print(f"📄 PAGE {ocr_data['pages'][i]['index'] + 1}")
print('='*60)
# Display the page content with proper whitespace preservation
display_page_with_tables(i, ocr_data, pdf_response)
# Add spacing between pages for better readability
print(f"\n{'\n'}")2.2 Form Elements: Checkboxes & Structured Fields (Page 1)
Page 1 contains a Patient Admission Form with checkboxes, handwriting, and fill-in lines. Mistral OCR 3 uses unified Unicode checkbox representation (☐ unchecked, ☑ checked) for consistent parsing.
# Display Page 1 - Form Elements
print("📄 PAGE 1: Patient Admission Form")
print("Notice: Checkboxes rendered as ☐ (unchecked) and ☑ (checked)\n")
display_page_with_tables(0, response_dict, pdf_response)2.3 HTML Table Output: Vital Signs Flowsheet (Page 2)
Page 2 contains a Vital Signs Flowsheet with complex table structures. Mistral OCR 3 gives the option to output tables as HTML with proper rowspan and colspan attributes, preserving the original structure for accurate data extraction.
# Display Page 2 - Vital Signs Flowsheet with HTML table
print("📄 PAGE 2: Vital Signs Flowsheet")
print("Notice: Tables output as HTML with rowspan/colspan preserved\n")
display_page_with_tables(1, response_dict, pdf_response)2.4 Image Annotations: X-ray (Page 3)
Page 3 contains an X-ray image. Mistral OCR 3 can detect, extract, and annotate images within documents. The image is embedded in the markdown output with base64 encoding.
# Display Page 3 - X-ray Image
print("📄 PAGE 3: Foot X-ray")
print("Notice: Images are detected and embedded with base64 encoding\n")
page3 = pdf_response.pages[2]
# Show image metadata
print(f"Images detected on this page: {len(page3.images)}")
for img in page3.images:
print(f" - Image ID: {img.id}")
print(f" Dimensions: ({img.top_left_x}, {img.top_left_y}) to ({img.bottom_right_x}, {img.bottom_right_y})")
print("\n" + "="*60)
print("Rendered Output (with embedded X-ray image):")
print("="*60)
# Display page with images and tables
display_page_with_tables(2, response_dict, pdf_response)3. Document Intelligence Using Annotations
Moving from notebooks to production requires patterns for scale, reliability, and interoperability. This section demonstrates a progressive pipeline where we perform the following using DocAI Annotations:
- Classification → Identify document types before extraction
- Batch Processing → Handle multiple documents concurrently
- FHIR Generation → Transform extracted data into healthcare-standard formats
Note: Refer to this cookbook for an intro to Annotations.
Production Patterns Setup - imports and utilities
import base64
import json
import time
import uuid
import os
from datetime import datetime
from enum import Enum
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from collections import Counter
from pydantic import BaseModel, Field, JsonValue
from mistralai import Mistral
from mistralai.extra import response_format_from_pydantic_model
from google.colab import userdata
api_key = userdata.get('MISTRAL_API_KEY') # Replace with your way to retrieve API key
if not api_key:
print("⚠️ WARNING: No API key found!")
print(" Set MISTRAL_API_KEY environment variable, or")
print(" Uncomment and set api_key directly above")
else:
client = Mistral(api_key=api_key)
print("✅ Mistral client initialized")
def encode_pdf(pdf_path: str) -> Optional[str]:
"""Encode a PDF file to base64 string."""
try:
with open(pdf_path, "rb") as pdf_file:
return base64.b64encode(pdf_file.read()).decode('utf-8')
except FileNotFoundError:
print(f"Error: The file {pdf_path} was not found.")
return None
except Exception as e:
print(f"Error: {e}")
return None
print("✅ Production patterns setup complete")3.1 Document Classification and Intelligent Routing
Healthcare organizations receive mixed document types daily—faxes, scanned forms, digital PDFs. Before extraction, classify incoming documents to determine:
- Document type (demographics, vitals, lab results, progress notes)
- Routing destination (billing, clinical, pharmacy)
- Processing urgency (stat vs routine)
This information will then inform which extraction schema to apply in the next section.
# Define classification schema for healthcare documents
# Possible values for incoming documents
class HealthcareDocumentType(str, Enum):
PATIENT_DEMOGRAPHICS = "patient_demographics"
PROGRESS_NOTES = "progress_notes"
VITALS_FLOWSHEET = "vitals_flowsheet"
LAB_RESULTS = "lab_results"
MEDICATION_LIST = "medication_list"
PRIOR_AUTHORIZATION = "prior_authorization"
INSURANCE_CARD = "insurance_card"
CONSENT_FORM = "consent_form"
UNKNOWN = "unknown"
# Possible values for departments to route to
class RoutingDepartment(str, Enum):
CLINICAL = "clinical"
BILLING = "billing"
PHARMACY = "pharmacy"
RECORDS = "medical_records"
INTAKE = "patient_intake"
# Classification schema
class DocumentClassification(BaseModel):
document_type: HealthcareDocumentType = Field(..., description="The primary type of healthcare document")
confidence: float = Field(..., description="Confidence score between 0.0 and 1.0")
routing_department: RoutingDepartment = Field(..., description="Department that should handle this document")
urgency: str = Field(..., description="Processing priority: 'stat', 'urgent', or 'routine'")
key_identifiers_found: List[str] = Field(default=[], description="Patient identifiers detected (e.g., 'MRN', 'DOB', 'Name')")
requires_signature: bool = Field(default=False, description="Whether document requires/contains signatures")
summary: str = Field(..., description="One-sentence summary of document contents")# Classify the patient packet document
base64_packet = encode_pdf(pdf_path)
# First-pass classification using only page 1
classification_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{base64_packet}"
},
pages=list(range(8)), # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit
document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
include_image_base64=False
)
# Parse and display classification
classification = json.loads(classification_response.document_annotation)
print("📋 Document Classification Results")
print("=" * 50)
print(f"Type: {classification['document_type']}")
print(f"Confidence: {classification['confidence']:.0%}")
print(f"Route to: {classification['routing_department']}")
print(f"Urgency: {classification['urgency']}")
print(f"Identifiers: {', '.join(classification['key_identifiers_found'])}")
print(f"Signature: {'Yes' if classification['requires_signature'] else 'No'}")
print(f"\nSummary: {classification['summary']}")3.2 Classification-Based Schema Extraction
Based on the classfication defined in the previous step, we'll extract different data elements from each classified document.
# Define type-specific extraction schemas for each document type
class PatientDemographics(BaseModel):
"""Schema for patient demographics and intake forms."""
patient_name: str = Field(..., description="Full patient name")
date_of_birth: str = Field(..., description="DOB in MM/DD/YYYY format")
gender: Optional[str] = Field(None, description="Patient gender")
address: Optional[str] = Field(None, description="Patient address")
phone: Optional[str] = Field(None, description="Contact phone number")
insurance_id: Optional[str] = Field(None, description="Insurance member ID")
emergency_contact: Optional[str] = Field(None, description="Emergency contact info")
class VitalsFlowsheet(BaseModel):
"""Schema for vital signs flowsheets."""
date_recorded: str = Field(..., description="Date vitals were recorded")
blood_pressure: Optional[str] = Field(None, description="Blood pressure reading (systolic/diastolic)")
heart_rate: Optional[str] = Field(None, description="Heart rate in BPM")
temperature: Optional[str] = Field(None, description="Body temperature")
respiratory_rate: Optional[str] = Field(None, description="Respiratory rate")
oxygen_saturation: Optional[str] = Field(None, description="SpO2 percentage")
weight: Optional[str] = Field(None, description="Patient weight")
height: Optional[str] = Field(None, description="Patient height")
# Map document types to their extraction schemas
EXTRACTION_SCHEMAS: Dict[str, type] = {
"patient_demographics": PatientDemographics,
"vitals_flowsheet": VitalsFlowsheet,
# Add more mappings as needed
}
print("✅ Defined extraction schemas for:", list(EXTRACTION_SCHEMAS.keys()))3.3 Combine Classification and Extraction with Batch Processing
Production systems process hundreds of documents daily. This pattern extends classification & extraction to handle multi-page packets where each page may be a different document type (demographics, vitals, lab results, etc.).
@dataclass
class PageResult:
"""Result container for a single processed page"""
page_index: int
document_type: str
classification: Dict[str, Any]
extracted_data: Optional[Dict[str, Any]]
markdown_content: str
status: str # "success", "error", "skipped"
error_message: Optional[str] = None
processing_time_ms: float = 0
def process_patient_packet(pdf_path: str, rate_limit_delay: float = 0.5) -> List[PageResult]:
"""
Process a multi-page patient packet with per-page classification and extraction.
Args:
pdf_path: Path to the PDF file
rate_limit_delay: Delay between API calls to respect rate limits
Returns:
List of PageResult objects with classification and extracted data
"""
base64_pdf = encode_pdf(pdf_path)
results = []
# First, get the full document to know page count
full_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{base64_pdf}"
},
include_image_base64=False
)
num_pages = len(full_response.pages)
print(f"📄 Processing {num_pages} pages from {pdf_path}")
print("-" * 50)
for page_idx in range(num_pages):
start_time = time.time()
try:
# Step 1: Classify this page
time.sleep(rate_limit_delay) # Rate limiting
classify_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{base64_pdf}"
},
pages=[page_idx],
document_annotation_format=response_format_from_pydantic_model(DocumentClassification),
include_image_base64=False
)
page_classification = json.loads(classify_response.document_annotation)
doc_type = page_classification["document_type"]
# Step 2: Extract with type-specific schema if available
extracted_data = None
if doc_type in EXTRACTION_SCHEMAS:
time.sleep(rate_limit_delay)
extract_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{base64_pdf}"
},
pages=[page_idx],
document_annotation_format=response_format_from_pydantic_model(EXTRACTION_SCHEMAS[doc_type]),
include_image_base64=False
)
extracted_data = json.loads(extract_response.document_annotation)
processing_time = (time.time() - start_time) * 1000
result = PageResult(
page_index=page_idx,
document_type=doc_type,
classification=page_classification,
extracted_data=extracted_data,
markdown_content=full_response.pages[page_idx].markdown,
status="success",
processing_time_ms=processing_time
)
print(f" ✅ Page {page_idx + 1}: {doc_type} ({processing_time:.0f}ms)")
except Exception as e:
result = PageResult(
page_index=page_idx,
document_type="unknown",
classification={},
extracted_data=None,
markdown_content="",
status="error",
error_message=str(e),
processing_time_ms=(time.time() - start_time) * 1000
)
print(f" ❌ Page {page_idx + 1}: Error - {str(e)[:50]}")
results.append(result)
return results
# Process the patient packet
batch_results = process_patient_packet("patient-packet-completed.pdf")
# Summary statistics
print("\n" + "=" * 50)
print("📊 BATCH PROCESSING SUMMARY")
print("=" * 50)
successful = [r for r in batch_results if r.status == "success"]
failed = [r for r in batch_results if r.status == "error"]
print(f"Total pages: {len(batch_results)}")
print(f"Successful: {len(successful)}")
print(f"Failed: {len(failed)}")
print(f"Total time: {sum(r.processing_time_ms for r in batch_results):.0f}ms")
# Group by document type
doc_types = Counter(r.document_type for r in successful)
print(f"\nDocument types found:")
for doc_type, count in doc_types.items():
print(f" • {doc_type}: {count} page(s)")
# Display extracted data for each page
print("📋 EXTRACTED DATA BY PAGE")
print("=" * 50)
for result in batch_results:
if result.status == "success":
print(f"\n🔹 Page {result.page_index + 1}: {result.document_type}")
print(f" Confidence: {result.classification.get('confidence', 'N/A'):.0%}")
print(f" Route to: {result.classification.get('routing_department', 'N/A')}")
if result.extracted_data:
print(" Extracted fields:")
for key, value in result.extracted_data.items():
if value: # Only show non-null values
print(f" • {key}: {value}")3.4 FHIR Resource Generation
Extracted data is only valuable if it integrates with clinical systems. FHIR (Fast Healthcare Interoperability Resources) is the industry standard for healthcare data exchange, supported by Epic, Cerner, and all major EHRs.
This pattern transforms our batch-extracted data into FHIR R4 resources:
- Patient → Demographics from intake forms
- Observation → Vital signs measurements
The resulting FHIR Bundle can be POSTed to any FHIR-compliant system.
# FHIR R4 Resource Generation from extracted OCR data
def generate_fhir_patient(demographics: Dict[str, Any]) -> Dict[str, Any]:
"""Convert extracted demographics to FHIR R4 Patient resource"""
# Parse name (assumes "Last, First" or "First Last" format)
name_parts = demographics.get("patient_name", "Unknown").replace(",", " ").split()
patient = {
"resourceType": "Patient",
"id": str(uuid.uuid4()),
"meta": {
"profile": ["http://hl7.org/fhir/us/core/StructureDefinition/us-core-patient"]
},
"identifier": [{
"system": "urn:oid:2.16.840.1.113883.4.1", # Example OID
"value": demographics.get("insurance_id", "UNKNOWN")
}],
"name": [{
"use": "official",
"family": name_parts[0] if name_parts else "Unknown",
"given": name_parts[1:] if len(name_parts) > 1 else []
}],
"birthDate": convert_date_to_fhir(demographics.get("date_of_birth")),
"gender": map_gender(demographics.get("gender")),
}
# Add address if present
if demographics.get("address"):
patient["address"] = [{
"use": "home",
"text": demographics["address"]
}]
# Add phone if present
if demographics.get("phone"):
patient["telecom"] = [{
"system": "phone",
"value": demographics["phone"],
"use": "home"
}]
return patient
def generate_fhir_vitals(vitals: Dict[str, Any], patient_id: str) -> List[Dict[str, Any]]:
"""Convert extracted vitals to FHIR R4 Observation resources"""
observations = []
# LOINC codes for common vitals
vital_mappings = {
"blood_pressure": {"code": "85354-9", "display": "Blood pressure panel"},
"heart_rate": {"code": "8867-4", "display": "Heart rate", "unit": "/min"},
"temperature": {"code": "8310-5", "display": "Body temperature", "unit": "Cel"},
"respiratory_rate": {"code": "9279-1", "display": "Respiratory rate", "unit": "/min"},
"oxygen_saturation": {"code": "2708-6", "display": "Oxygen saturation", "unit": "%"},
"weight": {"code": "29463-7", "display": "Body weight", "unit": "kg"},
"height": {"code": "8302-2", "display": "Body height", "unit": "cm"}
}
effective_date = convert_date_to_fhir(vitals.get("date_recorded")) or datetime.now().strftime("%Y-%m-%d")
for vital_key, loinc in vital_mappings.items():
value = vitals.get(vital_key)
if value:
observation = {
"resourceType": "Observation",
"id": str(uuid.uuid4()),
"status": "final",
"category": [{
"coding": [{
"system": "http://terminology.hl7.org/CodeSystem/observation-category",
"code": "vital-signs",
"display": "Vital Signs"
}]
}],
"code": {
"coding": [{
"system": "http://loinc.org",
"code": loinc["code"],
"display": loinc["display"]
}]
},
"subject": {"reference": f"Patient/{patient_id}"},
"effectiveDateTime": effective_date,
"valueString": str(value) # Using string for flexibility; production would parse numeric
}
observations.append(observation)
return observations
# Helper functions
def convert_date_to_fhir(date_str: Optional[str]) -> Optional[str]:
"""Convert various date formats to FHIR format (YYYY-MM-DD)"""
if not date_str:
return None
# Handle MM/DD/YYYY format
try:
parts = date_str.replace("-", "/").split("/")
if len(parts) == 3:
if len(parts[0]) == 4: # Already YYYY-MM-DD
return date_str
return f"{parts[2]}-{parts[0].zfill(2)}-{parts[1].zfill(2)}"
except:
pass
return date_str
def map_gender(gender_str: Optional[str]) -> str:
"""Map various gender representations to FHIR values"""
if not gender_str:
return "unknown"
g = gender_str.lower().strip()
if g in ["m", "male"]:
return "male"
elif g in ["f", "female"]:
return "female"
return "unknown"
print("✅ FHIR resource generators ready")def create_fhir_bundle_from_batch(batch_results: List[PageResult]) -> Dict[str, Any]:
"""
Create a FHIR Bundle from batch-processed OCR results.
Args:
batch_results: Results from process_patient_packet()
Returns:
FHIR R4 Bundle resource ready for EHR integration
"""
bundle = {
"resourceType": "Bundle",
"id": str(uuid.uuid4()),
"type": "transaction",
"timestamp": datetime.now().isoformat(),
"entry": []
}
patient_id = None
for result in batch_results:
if result.status != "success" or not result.extracted_data:
continue
doc_type = result.document_type
data = result.extracted_data
# Generate Patient resource from demographics
if doc_type == "patient_demographics":
patient_resource = generate_fhir_patient(data)
patient_id = patient_resource["id"]
bundle["entry"].append({
"fullUrl": f"urn:uuid:{patient_id}",
"resource": patient_resource,
"request": {
"method": "POST",
"url": "Patient"
}
})
# Generate Observations from vitals
elif doc_type == "vitals_flowsheet" and patient_id:
observations = generate_fhir_vitals(data, patient_id)
for obs in observations:
bundle["entry"].append({
"fullUrl": f"urn:uuid:{obs['id']}",
"resource": obs,
"request": {
"method": "POST",
"url": "Observation"
}
})
return bundle
# Generate FHIR Bundle from our batch results
fhir_bundle = create_fhir_bundle_from_batch(batch_results)
print("🏥 FHIR BUNDLE GENERATED")
print("=" * 50)
print(f"Bundle ID: {fhir_bundle['id']}")
print(f"Bundle Type: {fhir_bundle['type']}")
print(f"Total Resources: {len(fhir_bundle['entry'])}")
print(f"\nResources by type:")
resource_types = Counter(e["resource"]["resourceType"] for e in fhir_bundle["entry"])
for rtype, count in resource_types.items():
print(f" • {rtype}: {count}")
# Display the full FHIR Bundle (ready to POST to an EHR)
print("📄 FHIR BUNDLE JSON (Ready for EHR Integration)")
print("=" * 50)
print(json.dumps(fhir_bundle, indent=2))4. Summary & Next Steps
This cookbook demonstrated a complete production pipeline for healthcare OCR:
| Pattern | Purpose | Key Benefit |
|---|---|---|
| Classification | Identify document types | Route to correct extraction schema |
| Batch Processing | Handle multi-page packets | Scale with rate limiting & error isolation |
| FHIR Generation | Convert to healthcare standards | Integrate with Epic, Cerner, any FHIR EHR |
Production enhancements to consider:
- Confidence thresholds → Flag low-confidence extractions for human review
- Async processing → Use
asynciofor higher throughput - Audit logging → Track PHI access for HIPAA compliance
- FHIR validation → Validate bundles against US Core profiles before submission
- Webhook notifications → Alert downstream systems when processing completes