SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Low-resource languages suffer from a critical scarcity of high-quality Word Sense Disambiguation (WSD) evaluation data, severely hindering cross-lingual transfer research—especially for the Word-in-Context (WiC) task. To address this, we introduce the first fine-grained, multilingual WiC benchmark covering nine low-resource languages across diverse language families and writing systems. We propose a semi-automatic hybrid annotation framework integrating human verification, context-aware prompting, and multilingual pretrained language models—enabling scalable, high-accuracy sense annotation for low-resource languages for the first time. The dataset strictly adheres to the standard WiC format and is publicly released. Experimental results demonstrate substantial improvements in WSD performance on low-resource languages, establishing a foundational infrastructure for fair, robust, truly multilingual NLP and providing empirical support for cross-lingual semantic understanding.

Technology Category

Application Category

📝 Abstract

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

Problem

Research questions and friction points this paper is trying to address.

Creating high-quality evaluation datasets for low-resource languages

Developing semi-automatic annotation for polysemous word senses

Enhancing cross-lingual transfer for multilingual NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid semi-automatic annotation method

Polysemous word sense-annotated datasets

WiC-formatted cross-lingual transfer evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow