NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to jailbreaking attacks, where existing defenses often fail to balance security and utility, frequently blocking legitimate sensitive queries. The authors propose NeuroArmor, a white-box runtime defense mechanism that generates prompt-specific safe variants of inputs and performs consistency checks in the hidden state space to dynamically route requests to either rejection or recovery branches. By integrating selective re-anchoring and consistency verification, NeuroArmor enables fine-grained security control. Evaluated on Llama-3-8B-Instruct, it reduces attack success rates from 41.56% to 1.57% while lowering the false positive rate on benign requests from 30.26% to 22.05%, substantially outperforming current baselines.

📝 Abstract

Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

Problem

Research questions and friction points this paper is trying to address.

jailbreak defense

large language models

safety

helpfulness

adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

safe-variant-guided

representation consistency

selective re-anchoring