DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the clinical risks posed by hallucinations in large language models during drug-related question answering by proposing DrugClaw, a multi-agent retrieval-augmented system that leverages a reflection-driven state-machine workflow to retrieve and generate traceable, high-fidelity answers from authoritative pharmacological and pharmacovigilance knowledge bases. The approach innovatively integrates a multi-agent architecture with a state-machine pipeline and introduces DrugAudit—the first authority-aware evaluation benchmark—featuring fine-grained metrics for source alignment, semantic overlap, and citation faithfulness. Experimental results demonstrate that DrugClaw achieves state-of-the-art performance, with a primary-source citation rate of 0.918 (+10.1 percentage points), faithfulness of 0.887 (+5.9 pp), MedQA accuracy of 0.920, and PubMedQA score of 0.693, consistently outperforming existing methods.

📝 Abstract

Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).

Problem

Research questions and friction points this paper is trying to address.

drug-information question answering

hallucination

provenance

primary-source grounding

authority-aware evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation

multi-agent system

primary-source grounding