🤖 AI Summary
This work addresses the challenge of efficiently integrating independently pretrained unimodal large models—such as vision-language and audio-language systems—without relying on large-scale multimodal paired data or additional training. The authors propose SSAM, a novel framework that achieves the first training-free fusion of heterogeneous multimodal large models. By leveraging low-rank parameter decomposition, SSAM identifies language-relevant subspaces within each modality-specific expert and aligns their singular subspaces to enable parameter sharing while disentangling representations. This approach effectively preserves complementary modality-specific knowledge, substantially mitigates parameter interference, and supports flexible input combinations across arbitrary modalities. Evaluated on four benchmarks, SSAM outperforms existing training-free fusion methods and even surpasses certain jointly trained models in performance.
📝 Abstract
Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.