Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This study addresses the long-standing lack of research on citation need detection for low-resource languages in Wikipedia, where existing approaches rely heavily on large language models (LLMs) that are often inaccessible. To bridge this gap, the authors introduce MCN, the first multilingual citation need detection corpus spanning 18 languages, and propose fine-tuning small decoder-based language models (SLMs) with encoder-style objectives. Experimental results demonstrate that this approach significantly outperforms prompt-based LLMs in both monolingual and cross-lingual settings. Notably, a model trained exclusively on English data exhibits strong cross-lingual transfer capabilities, achieving remarkable performance across multiple low-resource languages without any in-language training examples.

📝 Abstract

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

Problem

Research questions and friction points this paper is trying to address.

Citation Needed Detection

Low-Resource Languages

Multilingual

Cross-Lingual

Automated Fact-Checking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Citation Needed Detection

small language models

multilingual