LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Ensuring robust, safety-aligned behavior in large language models (LLMs) without compromising practical utility remains challenging. Method: We propose LatentGuard—a three-stage framework integrating behavioral alignment with fine-grained, controllable steering in the latent space. It employs a structured VAE with multi-label supervised learning to disentangle latent representations for precise identification of attack types, tactics, and benign intents; augments LLM fine-tuning with reasoning-enhanced data; and implements semantically interpretable safety interventions via intermediate MLP activations. Results: On Qwen3-8B, LatentGuard significantly improves both reliability in rejecting adversarial inputs and utility in normal responses, while enhancing decision interpretability. Cross-architecture validation on Mistral-7B confirms strong generalizability. Contribution: This work introduces the first unified approach that jointly leverages multi-label structured latent-space modeling and behavior-level safety control—achieving a principled balance among security, controllability, and practicality.

Technology Category

Application Category

📝 Abstract

Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.

Problem

Research questions and friction points this paper is trying to address.

Balancing robust safety alignment with utility preservation in LLMs

Achieving fine-grained controllability at the representation level for safety

Enabling selective refusal of harmful requests while maintaining helpful responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage framework with behavioral alignment

Supervised VAE for disentangled latent representations

Targeted latent manipulation for selective refusal

🔎 Similar Papers

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis