Product Datasheet Analysis using Document AI

OCR

Overview

This cookbook demonstrates automated product datasheet analysis using Mistral AI's Document AI.

Use Case: Battery Procurement & Vendor Validation

You're sourcing lithium-ion batteries for a portable device. Vendors send PDF datasheets with hundreds of specifications. Manually comparing each against your design requirements is time-consuming and error-prone.

This cookbook automates the process:

  1. Extract structured data from lithium battery PDF datasheets using Document AI - Mistral OCR with Document Annotations
  2. Compare specifications against design requirements
  3. Generate detailed technical reports with comprehensive analysis for each parameter

Input Files Required:

  1. 📄 Product Datasheet PDF (lithium_iron_cell_datasheet.pdf)

    • Vendor-provided specification document containing technical specs, safety info, and performance data
  2. 📋 Design Requirements (battery_requirements.txt)

    • Your project's specification criteria defining acceptable ranges for capacity, voltage, temperature, safety, etc.

Technology Stack:

  • Mistral OCR (mistral-ocr-latest) - PDF parsing with document annotations
  • Mistral Medium (mistral-medium-latest) - Technical report generation

Key Features:

  • Document AI for OCR + structured extraction
  • Direct Pydantic schema extraction
  • Comprehensive battery specification coverage
  • Safety-focused validation
  • Professional technical report generation

Benefits: Fast, accurate, and generates professional documentation for procurement decisions.

1. Setup and Imports

# Install required packages (uncomment if needed)
# !pip install mistralai
import base64
import os
import json
from mistralai import Mistral
from mistralai.extra import response_format_from_pydantic_model
from pydantic import BaseModel, Field
from typing import List, Optional

print("✓ All imports successful")
# Initialize Mistral client
api_key = os.getenv("MISTRAL_API_KEY")
if not api_key:
    raise ValueError("MISTRAL_API_KEY environment variable not set. Please set it before running.")

client = Mistral(api_key=api_key)
print("✓ Mistral client initialized")

2. Define Data Schemas

We define comprehensive Pydantic schemas for lithium battery specifications including capacity, voltage, current, temperature, dimensions, and safety features.

# Schema for lithium battery specifications
class CapacitySpec(BaseModel):
    """Battery capacity specifications."""
    normal_capacity: float = Field(..., description="Normal capacity in mAh")
    minimum_capacity: float = Field(..., description="Minimum capacity in mAh")
    unit: str = Field("mAh", description="Capacity unit")

class VoltageSpec(BaseModel):
    """Voltage specifications."""
    nominal_voltage: float = Field(..., description="Nominal voltage in Volts")
    charge_voltage: float = Field(..., description="Charge voltage in Volts")
    discharge_cutoff_voltage: float = Field(..., description="Discharge cut-off voltage in Volts")

class CurrentSpec(BaseModel):
    """Current specifications."""
    standard_charge_current: float = Field(..., description="Standard charge current in mA")
    maximum_charge_current: float = Field(..., description="Maximum charge current in mA")
    standard_discharge_current: float = Field(..., description="Standard discharge current in mA")
    maximum_discharge_current: float = Field(..., description="Maximum discharge current in mA")
    max_instantaneous_discharge: float = Field(..., description="Maximum instantaneous discharge current in mA")

class TemperatureRange(BaseModel):
    """Temperature range specifications."""
    min_temp: float = Field(..., description="Minimum temperature in °C")
    max_temp: float = Field(..., description="Maximum temperature in °C")
    condition: str = Field(..., description="Condition (e.g., 'Charge', 'Discharge', 'Storage')")

class DimensionsSpec(BaseModel):
    """Physical dimensions specifications."""
    height: float = Field(..., description="Cell height in mm")
    diameter: float = Field(..., description="Diameter in mm")
    weight: float = Field(..., description="Weight in grams")

class PerformanceSpec(BaseModel):
    """Performance test results."""
    test_name: str = Field(..., description="Name of the performance test")
    criteria: str = Field(..., description="Performance criteria/requirement")
    result: str = Field(..., description="Test result status")

class LithiumBatterySpec(BaseModel):
    """Complete specification for a lithium battery cell."""
    model_name: str = Field(..., description="Model name or number")
    product_type: str = Field(..., description="Product type (e.g., 'Lithium-ion Cell Battery')")
    capacity: CapacitySpec = Field(..., description="Capacity specifications")
    voltage: VoltageSpec = Field(..., description="Voltage specifications")
    current: CurrentSpec = Field(..., description="Current specifications")
    internal_impedance: str = Field(..., description="Internal impedance specification")
    dimensions: DimensionsSpec = Field(..., description="Physical dimensions")
    cycle_life: int = Field(..., description="Cycle life (number of cycles)")
    operating_temperatures: List[TemperatureRange] = Field(..., description="Operating temperature ranges")
    storage_temperatures: List[TemperatureRange] = Field(..., description="Storage temperature ranges")
    performance_tests: List[PerformanceSpec] = Field(default=[], description="Performance test results")
    certifications: List[str] = Field(default=[], description="Certifications and standards")
    manufacturer: str = Field(..., description="Manufacturer company name")
    distributor: str = Field(..., description="Distributor/vendor information")
    warnings: List[str] = Field(default=[], description="Key safety warnings and precautions")

class LithiumBatterySchema(BaseModel):
    """Wrapper for extracted lithium battery specifications."""
    specs: List[LithiumBatterySpec] = Field(
        ..., description="List of extracted lithium battery specifications"
    )

print("✓ Pydantic schemas for lithium battery defined")

3. Helper Functions

def encode_pdf(pdf_path: str) -> str:
    """Encode PDF file to base64 string.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        Base64 encoded string of the PDF
    """
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    except Exception as e:
        raise Exception(f"Error encoding PDF: {str(e)}")

print("✓ Helper functions defined")

4. File Setup

Verify that the required files exist.

# Define file paths
PDF_PATH = "lithium_iron_cell_datasheet.pdf"
REQUIREMENTS_PATH = "battery_requirements.txt"

# Verify files exist
if os.path.exists(PDF_PATH):
    print(f"✓ Found PDF: {PDF_PATH}")
else:
    raise FileNotFoundError(f"❌ PDF not found: {PDF_PATH}")
    
if os.path.exists(REQUIREMENTS_PATH):
    print(f"✓ Found requirements: {REQUIREMENTS_PATH}")
else:
    raise FileNotFoundError(f"❌ Requirements not found: {REQUIREMENTS_PATH}")

5. Extract Structured Data with Document Annotations

This is the key feature of this cookbook. We use Mistral OCR's document_annotation_format parameter to extract structured battery specifications directly from the PDF in a single API call.

How it works:

  1. The PDF is encoded to base64
  2. Mistral OCR processes the document
  3. The document_annotation_format parameter tells the OCR to extract data matching our comprehensive battery schema
  4. We get back structured data including capacity, voltage, current, temperatures, dimensions, and safety specs

Benefits:

  • ✅ Single API call (no separate LLM call needed)
  • ✅ Direct schema extraction during OCR
  • ✅ More accurate (extraction happens with full document context)
  • ✅ Captures complex nested specifications
  • ✅ Safety-critical validation

Note:

Document annotations are limited to 8 pages. For larger documents, split them into chunks.

print("📄 Extracting structured data from battery datasheet...")
print(f"   Processing: {PDF_PATH}")

# Encode PDF to base64
base64_pdf = encode_pdf(PDF_PATH)
print("   ✓ PDF encoded to base64")

# Extract structured data using Mistral OCR with document annotations
print("   🔍 Running Mistral OCR with document annotations...")
annotations_response = client.ocr.process(
    model="mistral-ocr-latest",
    pages=list(range(8)),  # Document Annotations limited to 8 pages
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_pdf}"
    },
    document_annotation_format=response_format_from_pydantic_model(LithiumBatterySchema),
    include_image_base64=True
)

print(f"   ✓ OCR completed - {len(annotations_response.pages)} pages processed")
print("   ✓ Structured data extracted successfully")
# Parse the extracted data into our Pydantic model
extracted_data = LithiumBatterySchema(**json.loads(annotations_response.document_annotation))

print("\n" + "="*60)
print("🔋 EXTRACTED BATTERY SPECIFICATIONS")
print("="*60)
print(json.dumps(extracted_data.model_dump(), indent=2))

6. Generate Comparison Report

Now that we have structured battery data, we use Mistral LLM to compare it against design requirements and generate a detailed safety and performance report.

# Load design requirements
print("📋 Loading battery design requirements...")
with open(REQUIREMENTS_PATH, 'r') as f:
    requirements = f.read()
print(f"   ✓ Requirements loaded from {REQUIREMENTS_PATH}")
print("\nDesign Requirements:")
print(requirements)
print("\n📊 Generating detailed technical report with Mistral LLM...")

# Prepare the comparison prompt for narrative report generation
comparison_prompt = f"""You are an expert battery engineer specializing in lithium-ion battery safety, performance validation, and technical documentation.

I need you to write a comprehensive technical evaluation report comparing a lithium battery's specifications against design requirements.

Design Requirements:
{requirements}

Extracted Battery Specifications:
{json.dumps(extracted_data.model_dump(), indent=2)}

Please write a detailed technical report with the following sections:

# BATTERY VALIDATION REPORT

## 1. EXECUTIVE SUMMARY
Provide a 2-3 paragraph summary of the battery model, manufacturer, and overall compliance status. Include the final recommendation (APPROVED/REJECTED/CONDITIONAL APPROVAL).

## 2. BATTERY IDENTIFICATION
- Model Number
- Manufacturer
- Product Type
- Distributor

## 3. SPECIFICATION ANALYSIS

### 3.1 Capacity Analysis
Compare the normal and minimum capacity against requirements. Explain if it meets or fails the criteria with actual values.

### 3.2 Voltage Characteristics
Analyze nominal voltage, charge voltage, and discharge cut-off voltage. Discuss compliance with safety margins.

### 3.3 Current Capabilities
Evaluate standard and maximum charge/discharge currents. Discuss whether the battery can handle the required load profiles.

### 3.4 Physical Specifications
Verify dimensional compliance (diameter, height, weight) for 18650 standard format.

### 3.5 Performance Characteristics
Assess cycle life and internal impedance against requirements. Discuss implications for product lifetime.

### 3.6 Operating Conditions
Evaluate temperature ranges for charging, discharging, and storage. Identify any limitations or concerns.

## 4. SAFETY EVALUATION
Review safety certifications, protection features (over-charge, over-discharge, short-circuit), and compliance with standards (UN38.3, IEC62133).

## 5. QUALITY ASSESSMENT
Evaluate manufacturing facility certification and performance test results.

## 6. RISK ASSESSMENT
Identify any specification gaps, safety concerns, or operational limitations. Discuss potential risks and mitigation strategies.

## 7. FINAL RECOMMENDATION
Provide clear recommendation: APPROVED, REJECTED, or CONDITIONAL APPROVAL with specific conditions.

Write the report in professional technical language suitable for engineering documentation and procurement decisions. Be thorough, objective, and include specific values and comparisons throughout."""

# Generate narrative report using Mistral LLM (NO response_format - free text)
comparison_response = client.chat.complete(
    model="mistral-medium-latest",
    messages=[
        {"role": "user", "content": comparison_prompt}
    ],
    temperature=0.3  # Slightly higher for more natural writing
)

# Extract the narrative report
narrative_report = comparison_response.choices[0].message.content

print("   ✓ Technical report generated successfully")

7. Display Results

print("\n" + "="*80)
print("📋 BATTERY TECHNICAL EVALUATION REPORT")
print("="*80)
print(narrative_report)

8. Export Results

Save the complete battery analysis to a JSON file for future reference and compliance records.

# Export complete results including narrative report
output_json = "battery_analysis_results.json"
output_report = "battery_technical_report.md"

# Save JSON results
results = {
    "extracted_data": extracted_data.model_dump(),
    "narrative_report": narrative_report,
    "requirements": requirements
}

with open(output_json, 'w') as f:
    json.dump(results, f, indent=2)

# Save narrative report as markdown
with open(output_report, 'w') as f:
    f.write(narrative_report)

print(f"\n💾 Complete analysis saved to: {output_json}")
print(f"📄 Technical report saved to: {output_report}")

Conclusion

What We Built:

This cookbook demonstrated a production-ready workflow for analyzing lithium battery datasheets using pure Mistral AI capabilities:

  1. Comprehensive Data Extraction - Capacity, voltage, current, temperature, dimensions, safety specs
  2. Safety-Focused Validation - Protection features, certifications, operating limits
  3. Automated Compliance - Compare against industry standards and design requirements
  4. Detailed Reporting - Pass/fail analysis for each specification category

Key Advantages:

  • Document AI for extraction (no separate OCR/ LLM processing needed)
  • Comprehensive schema covering all critical battery specifications
  • Safety-critical validation for charge/discharge limits and temperature ranges

Technology Stack:

  • Mistral OCR (mistral-ocr-latest) with document annotations
  • Mistral Large (mistral-medium-latest) for comparison reports
  • Pydantic for comprehensive schema validation

Use Cases:

This workflow is perfect for:

  • Battery procurement - Validate vendor specifications
  • Quality control - Ensure compliance with design requirements
  • Safety validation - Check protection features and operating limits
  • Product development - Compare multiple battery options
  • Compliance reporting - Generate validation records

Extending This Cookbook:

You can easily adapt this workflow for:

  • Other electronic components (capacitors, resistors, ICs)
  • Different battery chemistries (LiFePO4, NiMH, etc.)
  • Resumes matching with Job description.

Limitations:

  • Document annotations are limited to 8 pages
  • For larger documents, split them into chunks and process separately