Vision
Vision capabilities enable models to analyze images and provide insights based on visual content in addition to text. This multimodal approach opens up new possibilities for applications that require both textual and visual understanding.
We provide a variety of models with vision capabilities, all available via the Chat Completions API.
For more specific use cases regarding Document Parsing, OCR and Data Extraction we recommend taking a look at our Document AI stack here.
Before You Start
Models with Vision Capabilities
- Pixtral 12B via
pixtral-12b-latest - Pixtral Large via
pixtral-large-latest - Mistral Medium 3.1 via
mistral-medium-2508 - Mistral Small 3.2 via
mistral-small-2506
Sending an Image
Use Vision Models
There are two ways to send an image to the Chat Completions API, either by passing a URL or by passing a base64 encoded image.
Before continuing, we recommend reading the Chat Competions documentation to learn more about the chat completions API and how to use it before proceeding.
If the image is hosted online, you can simply provide the publicaly accessible URL of the image in the request. This method is straightforward and does not require any encoding.
import os
from mistralai import Mistral
api_key = os.environ["MISTRAL_API_KEY"]
model = "mistral-small-2506"
client = Mistral(api_key=api_key)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": "https://docs.mistral.ai/img/eiffel-tower-paris.jpg"
}
]
}
]
chat_response = client.chat.complete(
model=model,
messages=messages
)Use cases
Below you can find a few examples of use cases leveraging our models vision, from understanding graphs to extract data, the use cases are diverse.
These are simple examples you can use as inspiration to build your own use cases, for OCR and Structured Outputs, we recommend leveraging Document AI and Document AI Annotations.
¡Meow! Click one of the tabs above to learn more.