Harnessing Optimization Dynamics for Curvature-Informed Model Merging

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of merging multiple specialized models—each trained for distinct capabilities (e.g., mathematical reasoning, code generation, instruction following, knowledge recall)—during supervised fine-tuning (SFT) without retraining. We propose two complementary methods: (1) OTA Merging, which leverages second-moment statistics from optimization trajectories as curvature proxies to enable curvature-aware weighted ensemble merging; and (2) Fast Fisher Grafting, which constructs low-rank task-specific masks using diagonal Fisher information for localized sparse parameter editing, augmented by memory-efficient second-moment compression. We identify curvature overlap as a fundamental mechanistic prerequisite for effective model merging. Experiments demonstrate that our approach significantly outperforms strong baselines across diverse multi-capability combinations, effectively mitigates negative transfer, improves merged model quality, and maintains robustness across varying sparsity levels. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Model merging is an effective post-training strategy for composing capabilities in large language models without joint retraining. We study this in the supervised fine-tuning (SFT) stage, where multiple capability-based SFT checkpoints -- spanning math, code, precise instruction following, general instruction following, and knowledge recall -- must be consolidated into a single model. We introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware aggregation that leverages optimizer second-moment statistics as a diagonal curvature proxy to reweight parameter edits and mitigate interference. Complementing OTA, we propose Fast Fisher Grafting (FFG), a curvature-driven task-localization step that sparsifies conflicting or low-importance edits. FFG induces extremely low-rank masks concentrated in early attention query/key projections and token embeddings, exploiting shared curvature across capabilities. We further develop a memory-light compression of the second moments that preserves OTA's effect. Across diverse capability-based SFT checkpoints, OTA+FFG improves merged-model quality over strong weight-space baselines, reduces negative transfer, and remains robust across sparsity levels. Analyses reveal substantial curvature overlap between checkpoints, offering a novel lens on why simple linear merging can be effective in practice. Ablations confirm that FFG is critical for reducing task interference and that the compressed second moments retain the gains of the full formulation. To facilitate reproducibility, we open-source all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints at https://github.com/pmahdavi/ota-merge.
Problem

Research questions and friction points this paper is trying to address.

Merging multiple fine-tuned language models without joint retraining
Reducing interference between different capability-specific model parameters
Developing curvature-aware methods to improve merged model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization Trajectory Aware curvature-aware aggregation
Fast Fisher Grafting sparsifies conflicting edits
Memory-light compression preserves second moments effect
🔎 Similar Papers
No similar papers found.
P
Pouria Mahdavinia
The Pennsylvania State University
H
Hamed Mahdavi
The Pennsylvania State University
N
Niloofar Mireshghallah
Carnegie Mellon University
Mehrdad Mahdavi
Mehrdad Mahdavi
Hartz Family Associate Professor of Computer Science @ Penn State
Machine LearningOptimization TheoryLearning Theory