🤖 AI Summary
Evaluating cross-lingual translation quality typically requires executing large language models—a computationally expensive and time-consuming process.
Method: We propose a novel, execution-free paradigm to predict translation quality by training lightweight gradient-boosted regression models on readily available metadata features—including token abundance ratio, token count, and typological attributes (genealogical family, writing system, geographic region)—using GPT-4o’s translation performance across 203 languages on the FLORES-200 benchmark.
Contribution/Results: Feature importance analysis reveals synergistic effects between linguistic typology and token abundance in quality estimation. Our model achieves R² = 0.66 for XX→English and R² = 0.72 for English→XX translation directions—substantially outperforming existing baselines. This approach enables low-cost, scalable assessment and language-specific adaptation of translation systems without inference overhead.
📝 Abstract
We show that translation quality can be predicted with surprising accuracy extit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$
ightarrow$English and $R^{2}=0.72$ for English$
ightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.