Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of scalable, automated risk assessment mechanisms in current medical dialogue agents for antidepressant recommendations, particularly their inability to account for variations in patient health literacy and behavioral patterns. The authors propose a multidimensional patient simulator grounded in real electronic health records, which—integrating the NIST AI Risk Management Framework with systematic manipulation of medical, linguistic, and behavioral features—generates diverse conversational scenarios to enable fine-grained evaluation of AI decision support accuracy and safety. Leveraging the All of Us dataset, health literacy modeling, and a hybrid human–LLM evaluation protocol, simulations across 500 cases reveal that model performance improves significantly with higher health literacy (concept retrieval accuracy rising from 47.9% to 81.6%) and demonstrate strong agreement between LLM-based and human judgments (F1 = 0.94, κ = 0.78).

Technology Category

Application Category

📝 Abstract
Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.
Problem

Research questions and friction points this paper is trying to address.

AI trustworthiness
conversational agents
risk assessment
antidepressant selection
patient simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

patient simulation
AI risk assessment
conversational agents
health literacy
NIST AI RMF
🔎 Similar Papers
No similar papers found.