U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in multimodal learning—including strong inter-modal coupling, training complexity, high memory overhead, and poor robustness to missing modalities—this paper proposes the Unified Unimodal Adaptation (U2A) framework. Methodologically, U2A introduces two key innovations: (1) a novel Mask Tokens mechanism that reconstructs missing modality features using only a single learnable token, enabling cross-modal feature generation; and (2) joint fine-tuning of pretrained unimodal encoders via low-rank adaptation (LoRA), achieving end-to-end multimodal adaptation without auxiliary modality imputation models or multi-stage training. Empirically, U2A achieves state-of-the-art performance under both full- and missing-modality settings, reduces parameter count by over 60%, and significantly lowers computational cost. Comprehensive evaluations across multiple tasks and datasets demonstrate its strong generalization capability and robustness to modality absence.

Technology Category

Application Category

📝 Abstract
Multimodal learning often relies on designing new models and complex training strategies to achieve optimal performance. We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks. Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies, such as alternating training, gradient modifications, or unimodal fine-tuning. To address missing modalities during both training and testing, we introduce Mask Tokens (MT), which generate missing modality features from available modalities using a single token per modality. This simplifies the process, removing the need for specialized feature estimation or prompt-tuning methods. Our evaluation demonstrates that U2A matches or outperforms state-of-the-art methods in both complete and missing modality settings, showcasing strong performance and robustness across various modalities, tasks, and datasets. We also analyze and report the effectiveness of Mask Tokens in different missing modality scenarios. Overall, our method provides a robust, flexible, and efficient solution for multimodal learning, with minimal computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning
Efficiency
Intelligence Flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

U2A Method
Multi-modal Learning
Masked Token Mechanism
🔎 Similar Papers
No similar papers found.
M
Md Kaykobad Reza
University of California Riverside
N
Niki Nezakati
University of California Riverside
Ameya Patil
Ameya Patil
Senior Applied Scientist, Amazon Lab126
on-device machine learningmultimodal machine learningsensor fusion
Mashhour Solh
Mashhour Solh
Amazon
Generative AIAgentic AIComputer VisionMultimodal FusionComputational Imaging
M
M. S. Asif
University of California Riverside