🤖 AI Summary
Large language models (LLMs) are susceptible to factual hallucinations induced by erroneous information in training data, undermining their reliability. Method: This paper systematically surveys factuality evaluation methodologies, addressing three core challenges: hallucination detection, limitations of existing benchmark datasets, and the reliability of evaluation metrics. We formulate five key research questions and propose a domain-customized fact-checking framework integrating instruction tuning, retrieval-augmented generation (RAG), multi-agent reasoning, and external knowledge integration. Enhanced interpretability and output consistency are achieved via advanced prompting strategies and domain-specific fine-tuning. Contribution/Results: Empirical results demonstrate that evidence-aligned evaluation—leveraging external verifiable sources—significantly outperforms purely autoregressive metrics in hallucination mitigation. The proposed framework advances the development of high-fidelity, context-aware, and domain-adapted trustworthy language models.
📝 Abstract
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.