🤖 AI Summary
Large pretrained models suffer from concept entanglement in task vector editing, leading to uncontrollable behavioral modulation. To address this, we propose an interpretable task vector decomposition framework that, for the first time, disentangles a task vector into two orthogonal components: a shared subspace component encoding general-purpose knowledge and a task-specific component capturing exclusive concepts. Decomposition is achieved via parameter-space projection and invariant subspace identification, enforced by an orthogonality constraint to ensure clean separation. Our method is broadly applicable across modalities—demonstrated on image classification, diffusion models, and large language models (LLMs). It improves multi-task fusion accuracy by 5% in vision tasks, preserves generation quality during style mixing in diffusion models, and reduces toxicity by 47% in LLMs without degrading general capabilities. This work overcomes the fundamental limitation of conventional vector arithmetic—its lack of concept-level controllability.
📝 Abstract
Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.