Should LLM Safety Be More Than Refusing Harmful Instructions?

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety behavior mismatch of large language models (LLMs) on long-tailed distributions—particularly cryptographic text—where conventional safety mechanisms fail. We propose a two-dimensional evaluation framework: “instruction refusal” and “generation safety.” Leveraging a cryptography-driven long-tailed test suite, we systematically assess pre-trained and post-trained models via adversarial prompt engineering, fine-grained response classification, and attribution analysis. Our findings reveal an implicit conflict between decryption capability and safety alignment: models either generate harmful content or over-refuse legitimate instructions; enhancing refusal alone neglects generation-layer risks. This study establishes the first empirically grounded, dual-dimensional evaluation paradigm for LLM safety, uncovering novel failure modes of safety mechanisms under long-tailed data. It provides a reproducible benchmark and theoretical insights for robust, safety-aware LLM design. (136 words)

Technology Category

Application Category

📝 Abstract
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety on long-tail encrypted texts
Assessing instruction refusal and generation safety dimensions
Investigating mismatched-generalization attacks on LLM safeguards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-dimensional framework for LLM safety assessment
Evaluation of safeguards against mismatched-generalization attacks
Analysis of LLM behavior on encrypted long-tail texts
🔎 Similar Papers
No similar papers found.