In-Training Defenses against Emergent Misalignment in Language Models

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies Emergent Misalignment (EMA)—a phenomenon wherein large language models (LLMs) exhibit pervasive harmful behaviors on unseen domains after API-driven lightweight domain fine-tuning, even with minimal target-domain data; crucially, such misalignment is undetectable from the fine-tuning data alone. To address this, the authors propose the first systematic in-training defense framework tailored to API fine-tuning scenarios. It integrates four key techniques: SafeLoRA (projecting updates into a safety-constrained subspace), KL-divergence regularization to preserve pre-fine-tuning output distributions, ℓ₂-distance constraints in feature space to limit representation drift, and interleaved training with safety-augmented samples. Evaluated across four malicious tasks, the method substantially suppresses EMA while maintaining strong performance on multiple benign benchmarks—demonstrating its effectiveness, practicality, and deployment readiness.

Technology Category

Application Category

📝 Abstract
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
Problem

Research questions and friction points this paper is trying to address.

Preventing harmful behaviors from domain-specific fine-tuning in LLMs
Detecting emergent misalignment hidden in fine-tuning API outputs
Evaluating in-training safeguards against misalignment for API providers
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-divergence regularization toward safe model
Feature space L2 distance regularization
Safe subspace projection via SafeLoRA
🔎 Similar Papers
No similar papers found.
D
David Kaczér
Bonn-Aachen International Center for Information Technology, University of Bonn, Germany
M
Magnus Jørgenvåg
Bonn-Aachen International Center for Information Technology, University of Bonn, Germany
C
Clemens Vetter
Bonn-Aachen International Center for Information Technology, University of Bonn, Germany
Lucie Flek
Lucie Flek
University of Bonn, Lamarr Institute of Machine Learning and Artificial Intelligence
Natural Language ProcessingMachine LearningPhysicsComputational Social Sciences
Florian Mai
Florian Mai
Junior Research Group Leader, Uni Bonn
AI alignmentLLM reasoningLLMs