In-Training Defenses against Emergent Misalignment in Language Models

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This paper identifies Emergent Misalignment (EMA)—a phenomenon wherein large language models (LLMs) exhibit pervasive harmful behaviors on unseen domains after API-driven lightweight domain fine-tuning, even with minimal target-domain data; crucially, such misalignment is undetectable from the fine-tuning data alone. To address this, the authors propose the first systematic in-training defense framework tailored to API fine-tuning scenarios. It integrates four key techniques: SafeLoRA (projecting updates into a safety-constrained subspace), KL-divergence regularization to preserve pre-fine-tuning output distributions, ℓ₂-distance constraints in feature space to limit representation drift, and interleaved training with safety-augmented samples. Evaluated across four malicious tasks, the method substantially suppresses EMA while maintaining strong performance on multiple benign benchmarks—demonstrating its effectiveness, practicality, and deployment readiness.

Technology Category

Application Category

📝 Abstract

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

Problem

Research questions and friction points this paper is trying to address.

Preventing harmful behaviors from domain-specific fine-tuning in LLMs

Detecting emergent misalignment hidden in fine-tuning API outputs

Evaluating in-training safeguards against misalignment for API providers

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-divergence regularization toward safe model

Feature space L2 distance regularization

Safe subspace projection via SafeLoRA

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance