🤖 AI Summary
This work addresses the lack of a standardized and reproducible offline evaluation protocol in speech-to-speech translation (S2ST), which hinders meaningful comparison across studies. The authors introduce COMPASS, the first unified S2ST benchmark encompassing multiple language pairs and evaluation dimensions. They systematically evaluate 1,248 model–language configurations on FLEURS and CVSS, integrating 46 automatic metrics—including TER, ChrF++, and UTMOS—and employ correlation analysis with human preferences to identify only 10 highly representative metrics per translation direction (Spearman ρ > 0.80), reducing evaluation time by approximately 2.5×. Notably, domain-specific metrics exhibit strong alignment with human judgments (ρ ≥ 0.90), revealing that reliance on a single metric can be misleading. COMPASS thus establishes an efficient, reliable, and domain-aware foundation for S2ST evaluation.
📝 Abstract
Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.