🤖 AI Summary
Medical large language models (LLMs) face critical trustworthiness and safety challenges in clinical deployment—including poor robustness, privacy leakage, clinical bias propagation, and frequent hallucinations—while existing static benchmarks lag behind and lack comprehensive coverage. To address this, we propose DAS, a Dynamic Adaptive Red-Teaming framework that employs a multi-agent adversarial mechanism to autonomously mutate test cases, evolve triggering strategies, and evaluate model responses in a closed-loop, human-free stress-testing pipeline. Evaluating 15 mainstream medical LLMs, DAS reveals alarming vulnerabilities: 94% fail robustness tests, 86% leak sensitive patient information, 81% exhibit clinically significant bias, and hallucination rates exceed 66%—deficiencies largely undetected by static evaluation. This work pioneers the shift from static red-teaming validation to an autonomous, evolutionary dynamic assurance paradigm, establishing a scalable, infrastructure-level safety verification framework for trustworthy clinical AI deployment.
📝 Abstract
Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80%, 94% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86% of scenarios, cognitive-bias priming altered clinical recommendations in 81% of fairness tests, and we identified hallucination rates exceeding 66% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.