Dialect & Emotion TTS Comparison

Why Dialect and Emotion Matter

Most TTS models are trained on standard Mandarin or American English, leaving millions of dialect speakers without natural-sounding synthesis. CosyVoice3 is specifically designed to preserve tonal patterns, vocabulary choices, and rhythmic cadences unique to regional speech variants. Combined with fine-grained emotion control, it enables audio content that resonates with specific audiences rather than sounding generically neutral.

Benchmark Methodology

We tested 5 leading TTS systems on 200 sentences across 4 Chinese dialects (Cantonese, Sichuanese, Shanghainese, Hokkien) and 6 emotions (neutral, happy, sad, angry, surprised, fearful). Each output was rated by 20 native dialect speakers on a 1–5 Mean Opinion Score (MOS) scale. Tests were conducted in May 2025 using each model's latest available version.

Overall Results

4.32CosyVoice3 MOS

4Dialects tested

6Emotions tested

200Test sentences

MOS Scores by Dialect

Dialect	CosyVoice3	Model B	Model C	Model D	Model E
Cantonese	4.41	3.82	3.65	4.12	3.28
Sichuanese	4.35	3.90	3.72	3.55	3.41
Shanghainese	4.18	3.45	3.30	3.88	2.95
Hokkien	4.28	3.22	3.10	3.70	2.80

MOS Scores by Emotion

Emotion	CosyVoice3	Model B	Model C	Model D
Neutral	4.50	4.45	4.38	4.52
Happy	4.42	4.10	3.95	4.20
Sad	4.28	3.85	3.70	4.05
Angry	4.15	3.60	3.55	3.90
Surprised	4.05	3.40	3.25	3.75
Fearful	3.88	3.20	3.10	3.60

CosyVoice3 Strengths & Weaknesses

✅ Strengths

Best-in-class dialect preservation
Smooth emotion transitions within a single utterance
Low latency (real-time factor < 0.5)
Open weights under Apache 2.0 license
Voice cloning with only 5 seconds of reference

❌ Weaknesses

Neutral tone slightly below Model D
English mixed-language quality is average
Fearful emotion can sound exaggerated
Requires CUDA GPU (no CPU inference yet)
Documentation is partly in Chinese only

Practical Recommendations

For dialect audiobooks: CosyVoice3 is the clear winner. No other model preserves tonal patterns this well across 4 dialects.
For standard Mandarin narration: Model D performs equally well and has lower VRAM requirements.
For emotional storytelling: CosyVoice3 excels, especially for happy and sad emotions. Use emotion tags in your script.
For English content: Consider specialized English-first models. CosyVoice3's English quality lags behind its Chinese capabilities.
For researchers publishing papers on speech synthesis, visualizing model architectures and attention patterns is essential. PatentFig can generate technical diagrams suitable for conference submissions.

How We Ensure Fair Testing

All models were tested on identical hardware (NVIDIA A100 80GB), using their default settings unless specified otherwise. Reference audio for voice cloning tests used the same 15-second clip across all models. Human raters were recruited from native dialect speaker communities and were not informed which model produced which sample (double-blind evaluation).