CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address data scarcity, distribution skew, and prohibitive retraining costs of large models induced by heterogeneous interaction data—characterized by multi-source origins, limited samples, and missing modalities—in cross-disciplinary learning, this paper proposes CM3T, a lightweight, plug-and-play framework for video-language multimodal fusion that seamlessly integrates with any Transformer backbone. CM3T introduces a novel dual-adapter architecture: a multi-head visual adapter enables efficient visual knowledge transfer, while a cross-attention adapter dynamically aligns modalities. Only 0.128× (unimodal) to 0.223× (bimodal) of the backbone parameters are fine-tuned; the backbone remains frozen, eliminating full retraining. Evaluated on Epic-Kitchens-100, MPIIGroupInteraction, and UDIVA v0.5, CM3T matches or surpasses state-of-the-art methods while improving parameter efficiency by 7.8× and substantially reducing computational and data requirements.

Technology Category

Application Category

📝 Abstract

Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.

Problem

Research questions and friction points this paper is trying to address.

Interdisciplinary Learning

Data Scarcity

Resource Consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

CM3T

Multi-modal Learning

Efficient Adaptation

🔎 Similar Papers

What to align in multimodal contrastive learning?