Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the lack of systematic safety evaluation for large language models (LLMs) in robotic health companionship scenarios, where they risk performing actions prohibited by medical ethics. The authors introduce the first benchmark grounded in medical ethical principles, comprising 270 instructions spanning nine categories of harmful behaviors, and evaluate the safety performance of 72 open- and closed-source LLMs in a simulated environment. Results reveal that closed-source models are significantly safer, with a median violation rate of 23.7% compared to 72.8% for open-source models. Model scale and release date influence safety outcomes, whereas medical fine-tuning and existing prompt-based defense strategies show limited efficacy. This work establishes a new benchmark and provides empirical evidence for assessing LLM safety in healthcare companion applications.

📝 Abstract

Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Robotic Health Attendant

Safety Evaluation

Medical Ethics

Harmful Instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety benchmarking

large language models

robotic health attendants