QE4PE: Word-level Quality Estimation for Human Post-Editing

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study empirically evaluates the practical utility of word-level quality estimation (QE) in professional machine translation post-editing (PE), addressing a critical gap in downstream usability research. Adopting a bilingual task paradigm, we engaged 42 professional translators to evaluate four QE highlighting strategies, integrating behavioral logging with fine-grained (word- and segment-level) human annotations—the first systematic investigation of QE’s impact on editing speed, output quality, and revision behavior in authentic professional settings. Methodologically, we innovatively combine supervised QE models with unsupervised uncertainty-based estimation, uncovering key moderating effects of domain adaptation, language direction, and translator proficiency. Results indicate no significant difference between QE highlighting and human annotation in perceived reliability; while overall editing efficiency remains unchanged, QE effectively guides more focused and precise revisions. Crucially, QE accuracy does not directly translate to practical efficacy, underscoring the necessity of human-centered design and contextual adaptation for real-world deployment.

Technology Category

Application Category

📝 Abstract

Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Problem

Research questions and friction points this paper is trying to address.

Impact of word-level QE on MT post-editing

Comparison of error-span highlight modalities

Effectiveness of QE highlights in professional workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Word-level QE for machine translation errors

Comparison of four error-span highlight modalities

Behavioral logs assess post-editing effort and productivity

🔎 Similar Papers

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction