Should LLM Safety Be More Than Refusing Harmful Instructions?

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the safety behavior mismatch of large language models (LLMs) on long-tailed distributions—particularly cryptographic text—where conventional safety mechanisms fail. We propose a two-dimensional evaluation framework: “instruction refusal” and “generation safety.” Leveraging a cryptography-driven long-tailed test suite, we systematically assess pre-trained and post-trained models via adversarial prompt engineering, fine-grained response classification, and attribution analysis. Our findings reveal an implicit conflict between decryption capability and safety alignment: models either generate harmful content or over-refuse legitimate instructions; enhancing refusal alone neglects generation-layer risks. This study establishes the first empirically grounded, dual-dimensional evaluation paradigm for LLM safety, uncovering novel failure modes of safety mechanisms under long-tailed data. It provides a reproducible benchmark and theoretical insights for robust, safety-aware LLM design. (136 words)

Technology Category

Application Category

📝 Abstract

This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety on long-tail encrypted texts

Assessing instruction refusal and generation safety dimensions

Investigating mismatched-generalization attacks on LLM safeguards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-dimensional framework for LLM safety assessment

Evaluation of safeguards against mismatched-generalization attacks

Analysis of LLM behavior on encrypted long-tail texts

🔎 Similar Papers

No similar papers found.