Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to short-sequence injection attacks at intermediate positions during text generation—a risk inadequately mitigated by existing alignment methods that focus solely on initial outputs. To ensure safety throughout the entire generation process, the authors propose a trajectory-based alignment mechanism that simulates intermediate perturbations by injecting adversarial sequences during reasoning. This approach integrates rejection-direction alignment with adversarial training to enforce consistent safety across all generation steps. Experimental results demonstrate that the method substantially enhances robustness against mid-generation injection attacks and generalizes effectively to early-stage attack scenarios, thereby validating that process-level alignment outperforms conventional paradigms relying only on final outputs or hidden states.

📝 Abstract

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

Problem

Research questions and friction points this paper is trying to address.

inference-time vulnerability

safety alignment

generation trajectories

token injection

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time vulnerability

generation trajectories

safety alignment