Why Dialect and Emotion Matter

Most TTS models are trained on standard Mandarin or American English, leaving millions of dialect speakers without natural-sounding synthesis. CosyVoice3 is specifically designed to preserve tonal patterns, vocabulary choices, and rhythmic cadences unique to regional speech variants. Combined with fine-grained emotion control, it enables audio content that resonates with specific audiences rather than sounding generically neutral.

Benchmark Methodology

We tested 5 leading TTS systems on 200 sentences across 4 Chinese dialects (Cantonese, Sichuanese, Shanghainese, Hokkien) and 6 emotions (neutral, happy, sad, angry, surprised, fearful). Each output was rated by 20 native dialect speakers on a 1–5 Mean Opinion Score (MOS) scale. Tests were conducted in May 2025 using each model's latest available version.

Overall Results

4.32CosyVoice3 MOS
4Dialects tested
6Emotions tested
200Test sentences

MOS Scores by Dialect

DialectCosyVoice3Model BModel CModel DModel E
Cantonese4.413.823.654.123.28
Sichuanese4.353.903.723.553.41
Shanghainese4.183.453.303.882.95
Hokkien4.283.223.103.702.80

MOS Scores by Emotion

EmotionCosyVoice3Model BModel CModel D
Neutral4.504.454.384.52
Happy4.424.103.954.20
Sad4.283.853.704.05
Angry4.153.603.553.90
Surprised4.053.403.253.75
Fearful3.883.203.103.60

CosyVoice3 Strengths & Weaknesses

✅ Strengths

  • Best-in-class dialect preservation
  • Smooth emotion transitions within a single utterance
  • Low latency (real-time factor < 0.5)
  • Open weights under Apache 2.0 license
  • Voice cloning with only 5 seconds of reference

❌ Weaknesses

  • Neutral tone slightly below Model D
  • English mixed-language quality is average
  • Fearful emotion can sound exaggerated
  • Requires CUDA GPU (no CPU inference yet)
  • Documentation is partly in Chinese only

Practical Recommendations

  • For dialect audiobooks: CosyVoice3 is the clear winner. No other model preserves tonal patterns this well across 4 dialects.
  • For standard Mandarin narration: Model D performs equally well and has lower VRAM requirements.
  • For emotional storytelling: CosyVoice3 excels, especially for happy and sad emotions. Use emotion tags in your script.
  • For English content: Consider specialized English-first models. CosyVoice3's English quality lags behind its Chinese capabilities.
  • For researchers publishing papers on speech synthesis, visualizing model architectures and attention patterns is essential. PatentFig can generate technical diagrams suitable for conference submissions.

How We Ensure Fair Testing

All models were tested on identical hardware (NVIDIA A100 80GB), using their default settings unless specified otherwise. Reference audio for voice cloning tests used the same 15-second clip across all models. Human raters were recruited from native dialect speaker communities and were not informed which model produced which sample (double-blind evaluation).