🤖 AI Summary
To address pervasive factual hallucinations in large language models (LLMs) applied to materials science, this work introduces HalluMatData—the first domain-specific benchmark for hallucination evaluation—and proposes HalluMatDetector, a multi-stage hallucination detection framework. HalluMatDetector integrates intrinsic credibility analysis, multi-source knowledge-augmented retrieval, graph neural network–driven contradiction reasoning, and multi-dimensional metric fusion. We innovatively propose the PHCS (Prompt-Hallucination Consistency Score) metric to quantify response inconsistency across semantically equivalent queries. Furthermore, we empirically uncover domain-specific hallucination entropy patterns in materials subfields and establish a strong correlation between high-entropy queries and hallucination severity. Experimental results demonstrate that HalluMatDetector improves hallucination detection rate by 30%, while PHCS effectively characterizes response reliability—enabling robust, interpretable, and domain-aware hallucination assessment for materials LLMs.
📝 Abstract
Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside this, we propose HalluMatDetector, a multi-stage hallucination detection framework that integrates intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability.