ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient scenario coverage and vulnerability to adversarial attacks in large language model (LLM) safety alignment, this paper proposes Proactive Reasoning Preference Optimization (PRPO). PRPO introduces a novel “pre-reasoning” paradigm that explicitly embeds structured safety rules into chain-of-thought (CoT) reasoning paths to enable proactive safety assessment. It further designs a length-controllable iterative preference optimization strategy, integrating rule-guided supervised fine-tuning (SFT) with direct preference optimization (DPO) to jointly enhance safety and inference efficiency. Experiments across multiple open-source LLMs demonstrate that PRPO improves adversarial robustness by 18.7% on average, retains over 92% of the original response efficiency, and incurs less than 5% additional inference latency—significantly outperforming existing alignment methods.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM safety alignment against harmful content
Improving coverage of diverse safety scenarios
Reducing vulnerability to adversarial attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ex-Ante Reasoning with Chain-of-Thought
Embedding predefined safety rules
Length-controlled iterative optimization strategy
🔎 Similar Papers
No similar papers found.
Kehua Feng
Kehua Feng
Ph.D. student, Zhejiang University
Natural Language ProcessingLanguage ModelAI for Science
K
Keyan Ding
College of Computer Science and Technology, Zhejiang University; ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
M
Menghan Li
School of Software Technology, Zhejiang University
Y
Yuhao Wang
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University; Polytechnic Institute, Zhejiang University
T
Tong Xu
College of Computer Science and Technology, Zhejiang University; ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University
Xinda Wang
Xinda Wang
University of Texas at Dallas
Software SecurityAI SecuritySystems Security
Q
Qiang Zhang
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University; ZJU-UIUC Institute, Zhejiang University
H
Huajun Chen
College of Computer Science and Technology, Zhejiang University; ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University