HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address pervasive factual hallucinations in large language models (LLMs) applied to materials science, this work introduces HalluMatData—the first domain-specific benchmark for hallucination evaluation—and proposes HalluMatDetector, a multi-stage hallucination detection framework. HalluMatDetector integrates intrinsic credibility analysis, multi-source knowledge-augmented retrieval, graph neural network–driven contradiction reasoning, and multi-dimensional metric fusion. We innovatively propose the PHCS (Prompt-Hallucination Consistency Score) metric to quantify response inconsistency across semantically equivalent queries. Furthermore, we empirically uncover domain-specific hallucination entropy patterns in materials subfields and establish a strong correlation between high-entropy queries and hallucination severity. Experimental results demonstrate that HalluMatDetector improves hallucination detection rate by 30%, while PHCS effectively characterizes response reliability—enabling robust, interpretable, and domain-aware hallucination assessment for materials LLMs.

Technology Category

Application Category

📝 Abstract
Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside this, we propose HalluMatDetector, a multi-stage hallucination detection framework that integrates intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability.
Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in LLM-generated materials science content
Evaluate factual consistency and robustness in AI-generated scientific text
Mitigate misleading information to improve research integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage verification framework for hallucination detection
Paraphrased Hallucination Consistency Score for quantifying inconsistencies
Benchmark dataset for evaluating materials science content hallucinations
🔎 Similar Papers
No similar papers found.
B
Bhanu Prakash Vangala
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
S
Sajid Mahmud
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
P
Pawan Neupane
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
J
Joel Selvaraj
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
Jianlin Cheng
Jianlin Cheng
Curators' Distinguished Professor, University of Missouri
BioinformaticsMachine LearningDeep LearningArtificial Intelligence