SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages suffer from a critical scarcity of high-quality Word Sense Disambiguation (WSD) evaluation data, severely hindering cross-lingual transfer research—especially for the Word-in-Context (WiC) task. To address this, we introduce the first fine-grained, multilingual WiC benchmark covering nine low-resource languages across diverse language families and writing systems. We propose a semi-automatic hybrid annotation framework integrating human verification, context-aware prompting, and multilingual pretrained language models—enabling scalable, high-accuracy sense annotation for low-resource languages for the first time. The dataset strictly adheres to the standard WiC format and is publicly released. Experimental results demonstrate substantial improvements in WSD performance on low-resource languages, establishing a foundational infrastructure for fair, robust, truly multilingual NLP and providing empirical support for cross-lingual semantic understanding.

Technology Category

Application Category

📝 Abstract
This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.
Problem

Research questions and friction points this paper is trying to address.

Creating high-quality evaluation datasets for low-resource languages
Developing semi-automatic annotation for polysemous word senses
Enhancing cross-lingual transfer for multilingual NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid semi-automatic annotation method
Polysemous word sense-annotated datasets
WiC-formatted cross-lingual transfer evaluation
🔎 Similar Papers
No similar papers found.
R
Roksana Goworek
Queen Mary University of London
H
Harpal Karlcut
Queen Mary University of London
M
Muhammad Shezad
Queen Mary University of London
N
Nijaguna Darshana
Queen Mary University of London
A
Abhishek Mane
Queen Mary University of London
S
Syam Bondada
Queen Mary University of London
R
R. Sikka
Queen Mary University of London
U
Ulvi Mammadov
Queen Mary University of London
R
Rauf Allahverdiyev
Queen Mary University of London
S
Sriram Purighella
Queen Mary University of London
P
Paridhi Gupta
Queen Mary University of London
M
M. Ndegwa
Queen Mary University of London
Haim Dubossarsky
Haim Dubossarsky
Lecturer, Queen Mary University of London
Natural Language ProcessingComputational LinguisticsLanguage Change