Audio

Transcribe speech to text, generate and clone voices from text, and build real-time voice agents with Mistral's Voxtral models.

Start with the outcome you want to build. Use the request-based audio APIs for files and bounded requests, the realtime API for live audio that needs low latency, and the full speech-to-speech pipeline for voice agents that listen, reason, and respond.

Choose your path

Goal	Models	How to use
Transcribe recordings, meetings, or call archives	Voxtral Mini Transcribe 2	Offline transcription
Transcribe live audio with sub-200ms latency	Voxtral Realtime	Realtime transcription
Generate natural speech or clone a voice	Voxtral TTS	Text to speech
Build a voice agent that listens, reasons, and speaks	Voxtral Realtime + an LLM + Voxtral TTS	Speech-to-speech pipeline

Capabilities

Speech to text

Convert speech to text with high accuracy and low latency. Two models cover batch and live workloads:

Voxtral Mini Transcribe 2 for batch transcription — speaker diarization, context biasing (up to 100 custom terms), word-level timestamps, recordings up to 3 hours per request, and noise-robust accuracy across 13 languages.
Voxtral Realtime (voxtral-mini-transcribe-realtime-2602) for live streaming — transcribes audio as it arrives with latency configurable down to sub-200ms, a 4B footprint for edge deployment, and open weights under Apache 2.0.

Note

Realtime transcription is not compatible with the diarize parameter. Use one or the other.

Text to speech

Voxtral TTS (voxtral-mini-tts-2603) generates expressive speech and clones any voice from a sample as short as 2–3 seconds, capturing tone, rhythm, and personality. The model treats the voice prompt as an instruction, so it follows the speaker's intonation and emotional rendering without separate prosody tags. It supports 9 languages with cross-lingual cloning and code-mixing, and streams with low processing latency (~90ms) for voice-agent use.

Voices: create and manage reusable voice profiles for consistent branding.
Speech generation: generate speech from saved voices or one-off reference clips, with basic or streaming delivery.

Speech-to-speech pipeline

Combine the models into a real-time voice-to-voice loop: Voxtral Realtime transcribes incoming speech, an LLM reasons over the transcript and decides a response, and Voxtral TTS speaks the reply.

Each component is independently customizable and deployable. Cross-lingual voice adaptation lets the same pipeline handle live translation while preserving the speaker's accent and identity.

Models

The Voxtral family covers transcription, speech generation, and audio understanding, available as Premier and open-weights models. Open a model card for capabilities, languages, pricing, and benchmarks.

Voxtral Mini Transcribe 2

An efficient audio input model, pre-trained and optimized for transcription purposes.

v26.02

Voxtral TTS

State-of-the-art TTS with zero-shot voice cloning and multilingual support.

v26.03

Voxtral Mini Transcribe Realtime

An efficient audio input model, pre-trained and optimized for live transcription purposes.

v26.02

Voxtral Small

Our first model with audio input capabilities for instruct use cases.

v25.07

Tip

Test transcription in the speech-to-text playground, and voice generation and cloning in the text-to-speech playground.

How teams use audio

Customer support: voice agents that route and resolve queries with natural, brand-appropriate speech.
Financial services: compliant voice AI for advisory, policy queries, and client onboarding.
Compliance and risk: real-time call monitoring with speaker attribution and auditable interaction records.
Manufacturing and field operations: voice interfaces for inspection and feedback in high-noise environments.
Meetings and sales: meeting intelligence with speaker attribution and automated follow-ups.
Real-time translation: cross-lingual voice adaptation for live translation that preserves speaker identity and accent.

FAQ