A Critical Study of Automatic Evaluation in Sign Language Translation

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates the reliability of existing automatic evaluation metrics for text-based sign language translation (SLT). Addressing the limitations of conventional metrics—including BLEU, ROUGE, chrF, and BLEURT—as well as large language model–based approaches such as G-Eval and GEMBA, we conduct controlled experiments along three dimensions: semantic consistency, hallucination sensitivity, and length robustness. Results reveal that traditional metrics exhibit weak semantic fidelity and high susceptibility to hallucinations, while LLM-based methods, though more flexible, suffer from stylistic biases in generated assessments. Crucially, we propose— for the first time—a multimodal-integrated evaluation framework tailored to SLT’s unique linguistic and visual characteristics, and empirically validate its necessity through ablation and comparative analysis. This work provides empirical evidence and methodological guidance for advancing SLT evaluation paradigms beyond unimodal text-centric benchmarks.

Technology Category

Application Category

📝 Abstract

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

Problem

Research questions and friction points this paper is trying to address.

Investigating limitations of text-based metrics for sign language translation evaluation

Assessing metric consistency under paraphrasing, hallucinations, and length variations

Demonstrating need for multimodal frameworks beyond current text-based approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed limitations of text-based SLT evaluation metrics

Assessed metrics under paraphrasing, hallucinations, length variations

Proposed multimodal frameworks beyond text for holistic assessment

🔎 Similar Papers

No similar papers found.

Authors to Follow