A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation of safety alignment in large language models (LLMs) under complex adversarial instructions—leading to increased harmful content generation—this paper proposes a multi-dimensional attack-defense alignment framework. Methodologically, it introduces a novel, multi-granularity adversarial instruction construction strategy covering semantic obfuscation, logical traps, and context-based induction, integrated with a high-fidelity safe-response generation mechanism and a safety-aware fine-tuning paradigm. We further develop SafeBench, a custom fine-grained safety evaluation benchmark, and conduct systematic validation on Llama3.2. Experiments demonstrate that our approach reduces harmful response rates by 42.6% under complex attacks while preserving general capabilities (e.g., MMLU, BBH), outperforming standard supervised fine-tuning (SFT), direct preference optimization (DPO), and existing safety alignment baselines. To our knowledge, this is the first work to achieve simultaneous, significant improvements in both robust defense capability and general task performance.

Technology Category

Application Category

📝 Abstract
Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models' general capabilities.
Problem

Research questions and friction points this paper is trying to address.

Large-scale models
Complex attack instructions
Security vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse Adversarial Training
Large Model Security Enhancement
Optimized Secure Response Generation
🔎 Similar Papers
No similar papers found.