Audio

Transcribe speech to text, generate and clone voices from text, and build real-time voice agents with Mistral's Voxtral models.

Start with the outcome you want to build. Use the request-based audio APIs for files and bounded requests, the realtime API for live audio that needs low latency, and the full speech-to-speech pipeline for voice agents that listen, reason, and respond.

Choose your path

Choose your path

GoalModelsHow to use
Transcribe recordings, meetings, or call archivesVoxtral Mini Transcribe 2Offline transcription
Transcribe live audio with sub-200ms latencyVoxtral RealtimeRealtime transcription
Generate natural speech or clone a voiceVoxtral TTSText to speech
Build a voice agent that listens, reasons, and speaksVoxtral Realtime + an LLM + Voxtral TTSSpeech-to-speech pipeline
Capabilities

Capabilities

Speech to text

Speech to text

Convert speech to text with high accuracy and low latency. Two models cover batch and live workloads:

  • Voxtral Mini Transcribe 2 for batch transcription — speaker diarization, context biasing (up to 100 custom terms), word-level timestamps, recordings up to 3 hours per request, and noise-robust accuracy across 13 languages.
  • Voxtral Realtime (voxtral-mini-transcribe-realtime-2602) for live streaming — transcribes audio as it arrives with latency configurable down to sub-200ms, a 4B footprint for edge deployment, and open weights under Apache 2.0.
Note

Realtime transcription is not compatible with the diarize parameter. Use one or the other.

Text to speech

Text to speech

Voxtral TTS (voxtral-mini-tts-2603) generates expressive speech and clones any voice from a sample as short as 2–3 seconds, capturing tone, rhythm, and personality. The model treats the voice prompt as an instruction, so it follows the speaker's intonation and emotional rendering without separate prosody tags. It supports 9 languages with cross-lingual cloning and code-mixing, and streams with low processing latency (~90ms) for voice-agent use.

  • Voices: create and manage reusable voice profiles for consistent branding.
  • Speech generation: generate speech from saved voices or one-off reference clips, with basic or streaming delivery.
Speech-to-speech pipeline

Speech-to-speech pipeline

Combine the models into a real-time voice-to-voice loop: Voxtral Realtime transcribes incoming speech, an LLM reasons over the transcript and decides a response, and Voxtral TTS speaks the reply.

Speech-to-speech pipeline: speech in flows to Voxtral Realtime (transcribe speech), then an LLM (reason over transcript), then Voxtral TTS (speak the reply), producing speech out.

Each component is independently customizable and deployable. Cross-lingual voice adaptation lets the same pipeline handle live translation while preserving the speaker's accent and identity.

Models

Models

The Voxtral family covers transcription, speech generation, and audio understanding, available as Premier and open-weights models. Open a model card for capabilities, languages, pricing, and benchmarks.

Tip
How teams use audio

How teams use audio

  • Customer support: voice agents that route and resolve queries with natural, brand-appropriate speech.
  • Financial services: compliant voice AI for advisory, policy queries, and client onboarding.
  • Compliance and risk: real-time call monitoring with speaker attribution and auditable interaction records.
  • Manufacturing and field operations: voice interfaces for inspection and feedback in high-noise environments.
  • Meetings and sales: meeting intelligence with speaker attribution and automated follow-ups.
  • Real-time translation: cross-lingual voice adaptation for live translation that preserves speaker identity and accent.
FAQ

FAQ