🤖 AI Summary
This work addresses the challenge of coreference resolution in low-resource languages, where progress is hindered by the scarcity of high-quality annotated data. The authors propose a machine translation–based data augmentation approach that generates training samples by translating English annotations into the target language. To ensure translation quality, they introduce a cycle-consistency mechanism combining back-translation and cosine similarity in the BERT hidden space to automatically evaluate the fidelity of translated samples. These consistency scores are then incorporated into the training loss as dynamic sample weights. This method achieves high-accuracy coreference resolution in low-resource settings without relying on any existing target-language corpora, demonstrating significant performance gains across four languages and validating the effectiveness and novelty of cycle-consistency–based quality assessment for cross-lingual transfer.
📝 Abstract
Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.