🤖 AI Summary
To address the degradation of safety alignment in large language models (LLMs) under complex adversarial instructions—leading to increased harmful content generation—this paper proposes a multi-dimensional attack-defense alignment framework. Methodologically, it introduces a novel, multi-granularity adversarial instruction construction strategy covering semantic obfuscation, logical traps, and context-based induction, integrated with a high-fidelity safe-response generation mechanism and a safety-aware fine-tuning paradigm. We further develop SafeBench, a custom fine-grained safety evaluation benchmark, and conduct systematic validation on Llama3.2. Experiments demonstrate that our approach reduces harmful response rates by 42.6% under complex attacks while preserving general capabilities (e.g., MMLU, BBH), outperforming standard supervised fine-tuning (SFT), direct preference optimization (DPO), and existing safety alignment baselines. To our knowledge, this is the first work to achieve simultaneous, significant improvements in both robust defense capability and general task performance.
📝 Abstract
Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models' general capabilities.