• Overview

Offline Transcription

Before You Start

Before You Start

Models with Audio Capabilities

Audio capable models:

  • Voxtral Small (voxtral-small-latest) with audio input for chat use cases.
  • Voxtral Mini (voxtral-mini-latest) with audio input for chat use cases
  • And Voxtral Mini Transcribe (voxtral-mini-latest via audio/transcriptions), with an efficient transcription only service.
tip

For faster transcription time, we recommend uploading your audio files.

Chat with Audio

Chat with Audio

Use Audio with Instruction Following models

Our Voxtral models are capable of being used for chat use cases with our chat completions endpoint.

tip

Before continuing, we recommend reading the Chat Completions documentation to learn more about the chat completions API and how to use it before proceeding.

To pass a local audio file, you can encode it in base64 and pass it as a string.

import base64
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "voxtral-mini-latest"

client = Mistral(api_key=api_key)

# Encode the audio file in base64
with open("examples/files/bcn_weather.mp3", "rb") as f:
    content = f.read()
audio_base64 = base64.b64encode(content).decode('utf-8')

chat_response = client.chat.complete(
    model=model,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": audio_base64,
            },
            {
                "type": "text",
                "text": "What's in this file?"
            },
        ]
    }],
)
Example Samples

Example Samples

Below you can find a few of the multiple use cases possible, by leveraging the audio capabilities of our models.

User

0:00

Assistant

Sure, here is the transcription of what you said:
"Today on July 15th, 2025, Mistral AI is announcing the new Voxtral models."
And here is the Python code to generate the first n digits of the Fibonacci sequence:

def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]

fib_sequence = [0, 1]
for i in range(2, n):
    next_value = fib_sequence[-1] + fib_sequence[-2]
    fib_sequence.append(next_value)

return fib_sequence

# Example usage:
n = 10  # Change this value to get the first n digits of the Fibonacci sequence
print(fibonacci(n))

This code defines a function fibonacci that takes an integer n and returns a list containing the first n digits of the Fibonacci sequence. You can change the value of n to get the desired number of Fibonacci digits.

Transcription

Transcription

Transcribe any Audio

Transcription provides an optimized endpoint for transcription purposes and currently supports voxtral-mini-latest, which runs Voxtral Mini Transcribe.

Parameters
We provide different settings and parameters for transcription, such as:

  • timestamp_granularities: This allows you to set timestamps to track not only "what" was said but also "when". You can find more about timestamps here.
  • diarize: This allows you to keep track of who is talking.
  • context_bias: Provide up to 100 words or phrases to guide the model toward correct spellings of names, technical terms, or domain-specific vocabulary. Particularly useful for proper nouns or industry terminology that standard models often miss. Context biasing is optimized for English; support for other languages is experimental. You can find more about context biasing here.
  • language: Our transcription service also works as a language detection service. However, you can manually set the language of the transcription for better accuracy if the language of the audio is already known.

Realtime: We provide a live transcription functionality. You can find more info about Realtime here.

Among the different methods to pass the audio, you can directly provide a path to a local file to upload and transcribe it as follows:

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "voxtral-mini-latest"

client = Mistral(api_key=api_key)

with open("/path/to/file/audio.mp3", "rb") as f:
    transcription_response = client.audio.transcriptions.complete(
        model=model,
        file={
            "content": f,
            "file_name": "audio.mp3",
        },
        ## language="en"
    )
Example Samples

Example Samples

Below you can find a few examples leveraging the audio transcription endpoint.

Audio

0:00

Transcription

This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms, in schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better President, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers. And I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases, and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees. to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there is an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our participation, regardless of which way the pendulum of power swings. It falls on each of us to be guardians of our democracy. to embrace the joyous task we've been given to continually try to improve this great nation of ours. Because for all our outward differences, we all share the same proud title – citizen. It has been the honor of my life to serve you as President. Eight years later, I am even more optimistic about our country's promise. And I look forward to working along your side as a citizen for all my days that remain. Thanks, everybody. God bless you. And God bless the United States of America.

Language

English

Transcription with Timestamps

Transcription with Timestamps

You can request timestamps for the transcription by passing the timestamp_granularities parameter, currently supporting segment and word.
It will return the start and end time of each segment in the audio file.

note

timestamp_granularities is currently not compatible with language, please use either one or the other.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "voxtral-mini-latest"

client = Mistral(api_key=api_key)

transcription_response = client.audio.transcriptions.complete(
    model=model,
    file_url="https://docs.mistral.ai/audio/obama.mp3",
    timestamp_granularities=["segment"] # or "word"
)
Context biasing

Context biasing

Provide up to 100 words or phrases to guide the model toward correct spellings of names, technical terms, or domain-specific vocabulary. Particularly useful for proper nouns or industry terminology that standard models often miss. Context biasing is optimized for English; support for other languages is experimental.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "voxtral-mini-2602"

client = Mistral(api_key=api_key)

with open("/path/to/file/audio.mp3", "rb") as f:
    transcription_response = client.audio.transcriptions.complete(
        model=model,
        file_url="https://docs.mistral.ai/audio/obama.mp3",
        context_bias="Chicago,Joplin,Boston,Charleston,farewell_address,self-government,citizenship,democracy,American_people,cancer_survivors,affordable_health_care,wounded_warriors,refugees,elected_officials,American_spirit,work_of_citizenship,guardians_of_our_democracy"
    )
FAQ

FAQ