Audio
Transcribe speech to text, generate and clone voices from text, and build real-time voice agents with Mistral's Voxtral models.
Start with the outcome you want to build. Use the request-based audio APIs for files and bounded requests, the realtime API for live audio that needs low latency, and the full speech-to-speech pipeline for voice agents that listen, reason, and respond.
Choose your path
| Goal | Models | How to use |
|---|---|---|
| Transcribe recordings, meetings, or call archives | Voxtral Mini Transcribe 2 | Offline transcription |
| Transcribe live audio with sub-200ms latency | Voxtral Realtime | Realtime transcription |
| Generate natural speech or clone a voice | Voxtral TTS | Text to speech |
| Build a voice agent that listens, reasons, and speaks | Voxtral Realtime + an LLM + Voxtral TTS | Speech-to-speech pipeline |
Capabilities
Speech to text
Convert speech to text with high accuracy and low latency. Two models cover batch and live workloads:
- Voxtral Mini Transcribe 2 for batch transcription — speaker diarization, context biasing (up to 100 custom terms), word-level timestamps, recordings up to 3 hours per request, and noise-robust accuracy across 13 languages.
- Voxtral Realtime (
voxtral-mini-transcribe-realtime-2602) for live streaming — transcribes audio as it arrives with latency configurable down to sub-200ms, a 4B footprint for edge deployment, and open weights under Apache 2.0.
Realtime transcription is not compatible with the diarize parameter. Use one or the other.
Text to speech
Voxtral TTS (voxtral-mini-tts-2603) generates expressive speech and clones any voice from a sample as short as 2–3 seconds, capturing tone, rhythm, and personality. The model treats the voice prompt as an instruction, so it follows the speaker's intonation and emotional rendering without separate prosody tags. It supports 9 languages with cross-lingual cloning and code-mixing, and streams with low processing latency (~90ms) for voice-agent use.
- Voices: create and manage reusable voice profiles for consistent branding.
- Speech generation: generate speech from saved voices or one-off reference clips, with basic or streaming delivery.
Speech-to-speech pipeline
Combine the models into a real-time voice-to-voice loop: Voxtral Realtime transcribes incoming speech, an LLM reasons over the transcript and decides a response, and Voxtral TTS speaks the reply.
Each component is independently customizable and deployable. Cross-lingual voice adaptation lets the same pipeline handle live translation while preserving the speaker's accent and identity.
Models
The Voxtral family covers transcription, speech generation, and audio understanding, available as Premier and open-weights models. Open a model card for capabilities, languages, pricing, and benchmarks.
Voxtral Mini Transcribe 2
An efficient audio input model, pre-trained and optimized for transcription purposes.
Voxtral TTS
State-of-the-art TTS with zero-shot voice cloning and multilingual support.
Voxtral Mini Transcribe Realtime
An efficient audio input model, pre-trained and optimized for live transcription purposes.
Voxtral Small
Our first model with audio input capabilities for instruct use cases.
Test transcription in the speech-to-text playground, and voice generation and cloning in the text-to-speech playground.
How teams use audio
- Customer support: voice agents that route and resolve queries with natural, brand-appropriate speech.
- Financial services: compliant voice AI for advisory, policy queries, and client onboarding.
- Compliance and risk: real-time call monitoring with speaker attribution and auditable interaction records.
- Manufacturing and field operations: voice interfaces for inspection and feedback in high-noise environments.
- Meetings and sales: meeting intelligence with speaker attribution and automated follow-ups.
- Real-time translation: cross-lingual voice adaptation for live translation that preserves speaker identity and accent.