Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

📅 2026-06-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses two critical failure modes in medical dialogue AI—premature diagnostic handoff and silent hallucinations—by proposing a multi-agent framework that replaces LLM-based routing with deterministic orchestration. The framework enforces adherence to the OLDCARTS clinical inquiry protocol through a neuro-symbolic state-tracking gate and employs semantic entropy, estimated via K=5 sampling, to quantify diagnostic uncertainty and intercept high-risk outputs. Evaluated on 150 simulated cases using Llama-3.1-70B-Instruct, the system achieves a diagnostic accuracy of 49.3%, representing an 11.3-percentage-point improvement over the baseline. Experimental results further reveal a statistically significant negative correlation between OLDCARTS protocol completeness and semantic entropy (r = −0.181, p < 0.05), demonstrating that structured clinical questioning effectively reduces model uncertainty.

📝 Abstract

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

Problem

Research questions and friction points this paper is trying to address.

premature diagnostic handoff

silent hallucination

Agentic AI

clinical safety

medical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic AI

neuro-symbolic state tracking

epistemic uncertainty quantification