Voxtral transcribes at the speed of sound.


Today we are launching Voxtral Transcribe 2, two next-generation speech-to-text models delivering industry-leading transcription quality, diarization, and ultra-low latency. The family includes Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime is open-weight under the Apache 2.0 license.

We’re also launching an audio playground in Mistral Studio to test transcription instantly, powered by Voxtral Transcribe 2, with logging and timestamping.

Strengths.

  • Voxtral Mini Transcribe V2: State-of-the-art transcription with speaker diarization, context weighting, and word-level timestamping in 13 languages.

  • Voxtral Realtime: Specifically designed for live transcription with configurable latency down to less than 200ms, enabling real-time voice agents and applications.

  • Best-in-class efficiency: Industry-leading accuracy at a fraction of the cost, with Voxtral Mini Transcribe V2 achieving the lowest word error rate, at the lowest price point.

  • Open Weights: Voxtral Realtime ships under Apache 2.0, deployable at the edge for privacy-focused applications.

Voxtral in real time.

Voxtral Realtime is specially designed for applications where latency is important. Unlike approaches that adapt models offline by processing audio in chunks, Realtime uses a new streaming architecture that transcribes audio as it arrives. The model provides transcriptions with configurable delay down to less than 200 ms, opening up a new class of speech applications.

Word error rates (the lower the better) in all languages ​​in the FLEURS transcription benchmark.

With a delay of 2.4 seconds, ideal for closed captioning, Realtime matches Voxtral Mini Transcribe V2, our latest batch model. With a delay of 480 ms, the word error rate remains between 1 and 2%, allowing voice agents to have near offline accuracy.

The model is natively multilingual, delivering strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B settings footprint, it operates efficiently on edge devices, ensuring privacy and security for sensitive deployments.

We publish model weights under Apache 2.0 on the Hug Face Hub.

Voxtral Mini Transcription V2.

Average diarization error rate (the lower the better) across five English benchmarks (Switchboard, CallHome, AMI-IHM, AMI-SDM, SBCSAE) and the multilingual TalkBank benchmark (German, Spanish, English, Chinese, Japanese).

Average word error rate (the lower the better) in the top 10 languages ​​of the FLEURS transcription benchmark.

Voxtral Mini Transcribe V2 delivers significant improvements in transcription and diarization quality across all languages ​​and domains. With a word error rate of around 4% on FLOWERS and $0.003/min, Voxtral offers the best value for money of any transcription API. It outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova in accuracy, and processes audio approximately 3x faster than ElevenLabs’ Scribe v2 while matching quality at one-fifth the cost.

Model features.

Voxtral Mini Transcribe 2 introduces key features.

Speaker diarisation.

Generate transcripts with speaker tags and accurate start/end times. Ideal for meeting transcription, interview analysis, and multi-party call handling. Note: In case of overlapping speech, the model generally transcribes a single speaker.

Contextual bias.

Provide up to 100 words or phrases to guide the model toward the correct spelling of names, technical terms, or domain-specific vocabulary. Particularly useful for proper nouns or industrial terminology that standard templates often miss. Contextual bias is optimized for English; Support for other languages ​​is experimental.

Word-level timestamps.

Generate accurate start and end timestamps for each word, enabling applications such as caption generation, audio search, and content alignment.

Extensive language support.

Like Realtime, this model now supports 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian and Dutch. Performance in languages ​​other than English far exceeds that of competitors.

Robustness to noise.

Maintains transcription accuracy in harsh acoustic environments, such as factories, busy call centers and field recordings.

Longer audio support.

Process recordings of up to 3 hours in a single request.

Word error rates (the lower the better) in all languages ​​in the FLEURS transcription benchmark.

Audio playground.

Test Voxtral Transcribe 2 directly in Mistral Studio. Upload up to 10 audio files, enable diarization, choose timestamp granularity, and add contextual bias terms for domain-specific vocabulary. Supports .mp3, .wav, .m4a, .flac, .ogg files up to 1GB each.

Transform voice applications.

Voxtral powers voice workflows across diverse applications and industries.

  • Meeting intelligence.

    Transcribe multilingual recordings with speaker logging that clearly attributes who said what and when. Priced at Voxtral, annotate large volumes of meeting content with industry-leading cost effectiveness.

  • Voice agents and virtual assistants.

    Build conversational AI with transcription latency below 200ms. Connect Voxtral Realtime to your LLM and TTS pipeline for responsive and natural voice interfaces.

  • Contact center automation.

    Transcribe calls in real time, allowing AI systems to analyze sentiment, suggest responses, and populate CRM fields while conversations are in progress. Stakeholder diarization ensures clear attribution between agents and customers.

  • Media and broadcasting.

    Generate live multilingual captions with minimal latency. Contextual bias handles proper names and technical terminology that cause generic transcription services to fail.

  • Compliance and documentation.

    Monitor and transcribe interactions to ensure regulatory compliance, with diarization providing clear stakeholder attribution and timestamps enabling accurate audit trails.

Both models support GDPR and HIPAA compliant deployments through secure on-premises or private cloud configurations.

To start.

Voxtral Mini Transcribe V2 is now available via API at $0.003 per minute. Try it now in the new Mistral Studio audio playground or in Le Chat.

Voxtral Realtime is available via API at $0.006 per minute and as open weights on Cuddly face.

Explore documentation on Mistral’s audio and transcription capabilities.

We are recruiting.

If you’re excited about building world-class voice AI and putting cutting-edge models in the hands of developers around the world, we’d love to hear from you. Apply to join our team.

Leave a Reply

Your email address will not be published. Required fields are marked *