CosyVoice 2.0 delivers 150ms streaming latency, MOS 5.53, emotion and dialect control including Cantonese and Sichuanese. Clone voices from 10-second samples.
150ms first-chunk streaming. Suitable for real-time voice assistants and interactive applications.
Happy, sad, angry, gentle, and more. Adjust emotional tone without changing the cloned voice identity.
Mandarin, Cantonese, Sichuanese, and English. Each dialect maintains natural pronunciation and intonation.
Clone any voice from a 10–20 second clean audio sample. Captures timbre, rhythm, and speaking style.
Mean Opinion Score of 5.53—comparable to professional human recordings in blind listening tests.
CosyVoice 2.0 cuts pronunciation errors by 30–50% compared to 1.0, especially on rare words and numbers.
CosyVoice is one module within Alibaba's Tongyi Lab speech technology suite. It focuses on production-grade text-to-speech with emphasis on Chinese language quality—including the tonal precision and regional dialect variations that make Mandarin TTS particularly challenging. The 2.0 release (December 2024) represented a major quality leap with 30–50% fewer pronunciation errors and streaming latency reduction to 150ms.
CosyVoice uses a flow-matching architecture: text is encoded to a semantic representation, which is then decoded into audio through a diffusion-based vocoder. The voice cloning capability works by extracting a speaker embedding from the reference audio and conditioning the generation on this embedding. Unlike autoregressive TTS models, the flow-matching approach generates audio segments in parallel, enabling the ultra-low streaming latency.
One of CosyVoice's unique capabilities is combining dialect and emotion control. You can generate Cantonese speech with a happy tone, or Sichuanese with a gentle delivery. The model treats these as orthogonal dimensions:
This three-axis control gives content creators fine-grained control over how their audio sounds without needing separate models for each combination.
CosyVoice 2.0 achieves a Mean Opinion Score (MOS) of 5.53 on standard evaluation sets. For context, professional human recordings typically score 5.0–5.5 on the same scale. The pronunciation error rate was reduced by 30–50% compared to version 1.0, with particular improvements in number reading (e.g., phone numbers, dates) and rare character pronunciation—common failure modes in Chinese TTS.
Common deployment patterns include customer service IVR systems (where the low latency is critical), audiobook generation for Chinese-language publishers, voice navigation for in-car systems, and accessibility applications that generate Cantonese or Sichuanese audio for elderly users who may not be comfortable with standard Mandarin interfaces.
CosyVoice is Alibaba Tongyi Lab's TTS model. Version 2.0 delivers 150ms streaming, MOS 5.53, and emotion/dialect control.
Mandarin, Cantonese, Sichuanese, and English. Community fine-tuning adds more dialects.
150ms first-chunk latency. Full latency depends on text length and hardware.
10–20 seconds of clean reference audio for good cloning quality.
CosyVoice is open-source. Check repo license for commercial terms.
CosyVoice3 provides guides, demos, and resources for Alibaba's Tongyi Lab voice synthesis technology. We focus on helping developers and content creators deploy high-quality Chinese and multi-dialect TTS in production.