A correlation-permutation approach for speech-music encoders model merging

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the misalignment of weight spaces between speech and music encoders in cross-modal audio model fusion—rendering direct parameter merging infeasible—this paper proposes a correlation-driven learnable permutation alignment method, the first to extend the Git Re-Basin paradigm to cross-modal audio encoder fusion. Our approach constructs layer-wise permutation matrices by maximizing inter-layer feature cross-correlation, enabling fine-grained Transformer layer alignment; subsequent parameter merging and reordering are performed linearly under the aligned topology. Compared to baselines such as linear interpolation, the fused model achieves an average +14.83 improvement on music understanding benchmarks while fully preserving original automatic speech recognition performance. The method incurs negligible computational overhead and maintains high fidelity, establishing a scalable, lightweight fusion paradigm for multi-task audio modeling.

Technology Category

Application Category

📝 Abstract

Creating a unified speech and music model requires expensive pre-training. Model merging can instead create an unified audio model with minimal computational expense. However, direct merging is challenging when the models are not aligned in the weight space. Motivated by Git Re-Basin, we introduce a correlation-permutation approach that aligns a music encoder's internal layers with a speech encoder. We extend previous work to the case of merging transformer layers. The method computes a permutation matrix that maximizes the model's features-wise cross-correlations layer by layer, enabling effective fusion of these otherwise disjoint models. The merged model retains speech capabilities through this method while significantly enhancing music performance, achieving an improvement of 14.83 points in average score compared to linear interpolation model merging. This work allows the creation of unified audio models from independently trained encoders.

Problem

Research questions and friction points this paper is trying to address.

Merge speech and music models without expensive pre-training

Align unaligned weight spaces in transformer layers

Enhance music performance while retaining speech capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Correlation-permutation aligns music and speech encoders

Extends merging to transformer layers via cross-correlations

Retains speech capabilities while enhancing music performance

🔎 Similar Papers

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations