Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety risk that large language models (LLMs) may generate harmful outputs when exposed to out-of-distribution unsafe inputs. We propose a lightweight, plug-and-play two-stage safety mechanism: first, a lightweight classifier trained solely on the final-token hidden states from intermediate LLM layers detects risky inputs; second, a dedicated safety model is invoked to generate compliant responses. Crucially, we empirically demonstrate—for the first time—that robust safety semantics are consistently encoded across intermediate layers of mainstream LLMs, enabling effective detection without fine-tuning the base model. Evaluated on multiple adversarial benchmarks, our method achieves >98% malicious prompt detection rate and <0.5% false positive rate on benign inputs, substantially outperforming existing baselines. The approach establishes a new, transferable, low-overhead paradigm for LLM safety alignment, requiring no architectural modification or retraining of the primary LLM.

Technology Category

Application Category

📝 Abstract
Ensuring the safety of the Large Language Model (LLM) is critical, but currently used methods in most cases sacrifice the model performance to obtain increased safety or perform poorly on data outside of their adaptation distribution. We investigate existing methods for such generalization and find them insufficient. Surprisingly, while even plain LLMs recognize unsafe prompts, they may still generate unsafe responses. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework, where we first identify unsafe prompts via a lightweight classifier, and apply a"safe"model only to such prompts. In particular, we explore the design of the safety detector in more detail, investigating the use of different classifier architectures and prompting techniques. Interestingly, we find that the final hidden state for the last token is enough to provide robust performance, minimizing false positives on benign data while performing well on malicious prompt detection. Additionally, we show that classifiers trained on the representations from different model layers perform comparably on the latest model layers, indicating that safety representation is present in the LLMs' hidden states at most model stages. Our work is a step towards efficient, representation-based safety mechanisms for LLMs.
Problem

Research questions and friction points this paper is trying to address.

Improves safety without degrading LLM performance.
Detects unsafe prompts using lightweight classifiers.
Explores safety representation in LLM hidden states.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step framework for safety
Lightweight classifier for unsafe prompts
Hidden state for robust detection
🔎 Similar Papers
No similar papers found.