CAMEx: Curvature-aware Merging of Experts

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing expert merging methods rely on Euclidean geometry assumptions, failing to accommodate the non-Euclidean curvature of parameter manifolds and thus limiting pretraining generalization; curvature-aware alternatives often require Fisher information matrix (FIM) approximations, incurring substantial memory overhead. This paper proposes CAMEx—a lightweight, curvature-aware expert merging protocol that avoids explicit FIM estimation. CAMEx models the parameter manifold geometry via natural gradients and introduces a dynamic weighted fusion mechanism, achieving improved geometric alignment while maintaining near-linear computational and memory complexity. Theoretical analysis and extensive multi-task experiments demonstrate that CAMEx significantly outperforms Euclidean baselines, yields more stable optimization trajectories, enhances generalization, and natively supports efficient scaling to large language models.

Technology Category

Application Category

📝 Abstract

Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx ( extbf{C}urvature- extbf{A}ware extbf{M}erging of extbf{Ex}perts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Enhance model generalization via curvature-aware merging

Reduce memory overhead in expert merging methods

Improve optimization in pre-training and fine-tuning phases

Innovation

Methods, ideas, or system contributions that make the work stand out.

CAMEx uses natural gradients effectively

Dynamic architecture optimizes resource utilization

CAMEx enhances model generalization efficiently

🔎 Similar Papers

No similar papers found.