"AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses hallucination detection in large language model (LLM)-generated content within multilingual scientific texts, particularly under low-resource and zero-shot language settings. We propose a data-centric approach: unifying, cleaning, and balancing five existing datasets—increasing training sample size by 172×—demonstrating that high-quality data construction substantially outperforms architectural modifications alone. Leveraging fine-tuned XLM-RoBERTa-Large, our method enables fine-grained hallucination detection across nine languages. At the SHROOM-CAP 2025 benchmark, it achieves a Factuality F1 of 0.5107 in zero-shot Gujarati—ranking second—and places within the top six for all other eight languages. Our core contribution is the empirical validation that meticulously curated multilingual data is decisive for hallucination detection performance, establishing a scalable paradigm for enhancing AI trustworthiness in low-resource languages.

Technology Category

Application Category

📝 Abstract

The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including extbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.

Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in multilingual scientific text from LLMs

Address training data scarcity and imbalance issues

Improve detection performance for low-resource zero-shot languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric strategy addresses training data scarcity

Unified and balanced five datasets for training corpus

Fine-tuned XLM-RoBERTa-Large on enhanced multilingual dataset

🔎 Similar Papers

No similar papers found.