When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work identifies a dual instability in LLM agents under long-context regimes: both task performance and safety refusal behavior degrade unpredictably as context length increases. Specifically, task accuracy drops by over 50% at 100K tokens, while refusal rates for harmful queries exhibit sharp, non-monotonic fluctuations—e.g., GPT-4.1-nano rises from 5% to 40%, and Grok-4-Fast plummets from 80% to 10%. To investigate, we design a scalable agent framework supporting million-token contexts and native tool integration, enabling systematic evaluation of context length, content type, and positional effects. Our analysis reveals, for the first time, that safety refusal degrades non-monotonically with context expansion—a phenomenon exposing critical flaws in existing multi-step evaluation paradigms. These findings provide empirical grounding for long-context safety mechanisms and introduce a novel evaluation dimension centered on contextual stability of safety alignment.

Technology Category

Application Category

📝 Abstract

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $sim$5% to $sim$40% while Grok 4 Fast decreases from $sim$80% to $sim$10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

Problem

Research questions and friction points this paper is trying to address.

Evaluates safety and performance of long-context LLM agents

Identifies unstable refusal behaviors in extended contexts

Reveals degradation in both benign and harmful task handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLM agent safety in long-context scenarios

Testing refusal rate changes with extended token windows

Analyzing performance degradation in multi-step agent tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow