SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current safety evaluations of large language models primarily focus on their ability to refuse unsafe prompts, overlooking the tangible security risks arising from action sequences executed in realistic programming environments. This work introduces the first operation safety benchmark tailored to stateful project workspaces, systematically assessing model safety by tracking action sequences, comparing final environment states, and establishing a fine-grained safety taxonomy that distinguishes the root causes of violations. Experimental results reveal that even state-of-the-art models exhibit harmful violation rates exceeding 54%, underscoring the inadequacy of existing alignment approaches for real-world deployment. Furthermore, the benchmark successfully captures distinct safety behavior profiles across different models, offering nuanced insights into their operational reliability.

📝 Abstract

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.

Problem

Research questions and friction points this paper is trying to address.

operational safety

LLM coding agents

stateful workspaces

safety benchmarking

environment-aware safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

operational safety

stateful workspaces

coding agents