🤖 AI Summary
Low-resource languages suffer from a critical scarcity of high-quality Word Sense Disambiguation (WSD) evaluation data, severely hindering cross-lingual transfer research—especially for the Word-in-Context (WiC) task. To address this, we introduce the first fine-grained, multilingual WiC benchmark covering nine low-resource languages across diverse language families and writing systems. We propose a semi-automatic hybrid annotation framework integrating human verification, context-aware prompting, and multilingual pretrained language models—enabling scalable, high-accuracy sense annotation for low-resource languages for the first time. The dataset strictly adheres to the standard WiC format and is publicly released. Experimental results demonstrate substantial improvements in WSD performance on low-resource languages, establishing a foundational infrastructure for fair, robust, truly multilingual NLP and providing empirical support for cross-lingual semantic understanding.
📝 Abstract
This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.