Caroline Bishop
Feb 04, 2026 16:53
Mistral releases Voxtral Transcribe 2 with real-time streaming at $0.003/min, undercutting opponents whereas matching accuracy. Open weights beneath Apache 2.0.
Mistral AI dropped its second-generation speech-to-text fashions on February 4, 2026, with Voxtral Transcribe 2 delivering what the corporate claims is state-of-the-art transcription at a fraction of competitor pricing. The headline quantity: $0.003 per minute for batch processing, roughly one-fifth the price of ElevenLabs’ Scribe v2.
The discharge contains two fashions serving completely different use instances. Voxtral Mini Transcribe V2 handles batch jobs with speaker diarization and word-level timestamps. Voxtral Realtime targets stay purposes with latency configurable right down to sub-200 milliseconds—quick sufficient for voice brokers that do not really feel sluggish.
Efficiency Claims Stack Up Towards Huge Names
Mistral’s benchmarks present roughly 4% phrase error fee on FLEURS, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, Meeting Common, and Deepgram Nova on accuracy. The corporate additionally claims 3x quicker processing than ElevenLabs’ Scribe v2 whereas matching high quality.
At 2.4 seconds delay—appropriate for stay subtitling—Realtime matches the batch mannequin’s accuracy. Drop to 480ms and also you’re taking a look at 1-2% further phrase error fee, which Mistral positions as acceptable for conversational AI purposes.
Open Weights Change the Deployment Math
Voxtral Realtime ships beneath Apache 2.0, that means enterprises can deploy on-premise with out API calls. With a 4B parameter footprint, the mannequin runs on edge gadgets—a major consideration for healthcare, finance, and different sectors the place audio information cannot depart inner infrastructure.
Each fashions help GDPR and HIPAA-compliant deployments, addressing the compliance complications which have slowed enterprise AI adoption.
What’s Really New Since July 2025
When Mistral launched the unique Voxtral in July 2025, speaker diarization was notably absent—the corporate was actively looking for design companions for the characteristic. This launch delivers on that promise, including exact speaker attribution with begin/finish instances. Context biasing is one other addition, letting customers feed as much as 100 domain-specific phrases to enhance accuracy on correct nouns and technical vocabulary.
Language help expanded to 13 languages together with Chinese language, Hindi, Arabic, Japanese, and Korean. Audio file limits jumped to three hours per request, up from the unique 30-40 minute caps.
Enterprise Implications
The pricing construction creates fascinating dynamics for contact facilities and assembly intelligence platforms at the moment paying premium charges for transcription APIs. At $0.003/min, transcribing one million minutes of audio runs $3,000—a quantity that makes beforehand cost-prohibitive use instances abruptly viable.
Voxtral Realtime’s $0.006/min pricing for streaming purposes nonetheless undercuts most opponents whereas enabling real-time sentiment evaluation and stay agent help options.
Builders can take a look at each fashions instantly by way of Mistral Studio’s new audio playground or seize Realtime weights immediately from Hugging Face.
Picture supply: Shutterstock
