🤖 AI Summary
Medical large language models (LLMs) frequently generate responses containing conditional, conversational, or hypothetical statements—structures that existing factuality evaluation methods struggle to decompose reliably, leading to hallucinations and misjudgments. To address this, we propose the first condition-aware medical fact decomposition framework. Our method employs a multi-stage, LLM-driven decomposition-verification architecture integrating medical knowledge–sensitive prompt engineering, explicit conditional constraint parsing, and domain-adapted verification corpora. Crucially, it preserves clinically essential contextual dependencies while tripling the yield of extractable atomic facts. Experiments demonstrate substantially higher decomposition quality versus baselines; moreover, factuality scoring proves highly sensitive to decomposition strategy, verification corpus design, and base model choice—validating the necessity of end-to-end customization for trustworthy medical fact assessment.
📝 Abstract
While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.