Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of safety alignment in large language models during task-specific fine-tuning, a common issue caused by downstream data. To mitigate this, the authors propose DualSelect, a novel framework that introduces, for the first time, a joint selection mechanism coupling task-relevant samples with safety reference examples. Through a minimax optimization strategy, DualSelect dynamically selects task-compatible data while refreshing safety references, thereby jointly enhancing task performance and safety alignment during fine-tuning. The method integrates an entropy-regularized scoring proxy, delayed reference updating, and gradient correction to enable efficient and coupled selection. Experiments across 1B–8B parameter models demonstrate that DualSelect achieves at least a 5.10-point improvement in average safety scores on the REDORCA benchmark over the strongest baseline, consistently outperforming all competitors across multiple evaluators with moderate computational overhead.
📝 Abstract
Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show task updates expose different safety constraints, motivating joint selection of relevant references and compatible task samples. We propose DualSelect, a coupled framework for task and reference selection that refreshes task conditioned safety references before filtering whole task samples compatible with the induced reference direction. Under a minimax view, DualSelect selects safety references with high preservation loss and task conflict, together with compatible task samples, through entropy-regularized scoring surrogates, lazy reference refresh, and gradient correction. On 1B-8B LLMs, DualSelect preserves safety without losing task utility; using the REDORCA judge, it improves Safety Avg. over the strongest baseline by at least 5.10 points and remains highest in Safety Avg. across judges with moderate overhead. This view extends to retention focused continual learning.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
fine-tuning
task-reference selection
safety alignment
continual learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

coupled selection
safety-preserving fine-tuning
reference refreshing
entropy-regularized scoring
minimax optimization
🔎 Similar Papers