🤖 AI Summary
This work investigates how fine-tuning large language models—even with benign data—can inadvertently degrade alignment and adversarial robustness, with the influence of fine-tuning objectives remaining unclear. Under controlled conditions fixing data, domain, architecture, and optimization settings, the study systematically compares six fine-tuning objectives: supervised fine-tuning (SFT), direct preference optimization (DPO), conditional fine-tuning, inoculation prompting, odds ratio preference optimization (ORPO), and KL regularization. Results show that while all methods perform similarly under limited training scales, ORPO and KL regularization substantially enhance adversarial robustness and mitigate role drift at larger scales, highlighting the critical role of constrained optimization objectives in balancing safety and capability.
📝 Abstract
Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.