🤖 AI Summary
Detecting hallucinations in NLP models under low-resource and unlabeled or weakly labeled settings remains challenging due to the scarcity of high-quality annotations. Method: This paper proposes a novel framework integrating few-shot optimization with structured data reformatting: (i) an iterative prompt engineering pipeline built upon DeepSeek to refine weak labels, and (ii) reformulation of raw data into instruction-tuning format. It further pioneers the adaptation of Mistral-7B-Instruct-v0.3 to the SHROOM task, enabling high-fidelity hallucination annotation under unsupervised or weakly supervised conditions—without reliance on large-scale annotated corpora. Contribution/Results: The approach significantly improves model generalizability and robustness in resource-constrained environments. Evaluated on the SHROOM SemEval-2024 test set, it achieves 85.5% accuracy, establishing a new state-of-the-art. This work provides a scalable, deployment-friendly methodology for low-resource hallucination detection.
📝 Abstract
Hallucination detection in text generation remains an ongoing struggle for natural language processing (NLP) systems, frequently resulting in unreliable outputs in applications such as machine translation and definition modeling. Existing methods struggle with data scarcity and the limitations of unlabeled datasets, as highlighted by the SHROOM shared task at SemEval-2024. In this work, we propose a novel framework to address these challenges, introducing DeepSeek Few-shot optimization to enhance weak label generation through iterative prompt engineering. We achieved high-quality annotations that considerably enhanced the performance of downstream models by restructuring data to align with instruct generative models. We further fine-tuned the Mistral-7B-Instruct-v0.3 model on these optimized annotations, enabling it to accurately detect hallucinations in resource-limited settings. Combining this fine-tuned model with ensemble learning strategies, our approach achieved 85.5% accuracy on the test set, setting a new benchmark for the SHROOM task. This study demonstrates the effectiveness of data restructuring, few-shot optimization, and fine-tuning in building scalable and robust hallucination detection frameworks for resource-constrained NLP systems.