Speech Generation

Generate speech from text using a saved voice (voice_id) or a one-off reference audio clip (ref_audio).

Generate

Generate

Generate speech using a previously saved voice (voice_id) identifier.

import base64
from pathlib import Path
from mistralai.client import Mistral

client = Mistral(api_key="your-api-key")

response = client.audio.speech.complete(
    model="voxtral-mini-tts-2603",
    input="Hello! This is Voxtral, Mistral's text-to-speech model.",
    voice_id="your-voice-id",
    response_format="mp3",
)

Path("output.mp3").write_bytes(base64.b64decode(response.audio_data))
print("Saved to output.mp3")
Best Practices

Best Practices

Text Prompt Guidelines

Text Prompt Guidelines

  • Language match: the voice prompt should be in the same language as the text prompt for best results.
  • Cross-lingual prompts: the model also supports cross-lingual voice transfer. For example, a French voice prompt with English text will produce French-accented English.
  • Verbalizable form: convert numbers and symbols to their spoken equivalent to avoid ambiguity. For example, use one thousand two hundred thirty four instead of 1234, or twelve thirty four depending on context.
  • No rich formatting: avoid markdown, emojis, or special characters in the text — they will not be rendered and may degrade output quality.
  • Abbreviations: spell out abbreviations for better pronunciation. Use F-B-I or F.B.I. instead of FBI.
  • Length: keep prompts under 300 words for best results.
Response Audio Formats

Response Audio Formats

FormatDescription
mp3Compressed, suitable for most use cases
wavUncompressed PCM, highest quality
pcmRaw float32 LE samples — recommended for streaming (lowest latency)
flacLossless compression
opusLow bitrate, good for streaming