Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe negative transfer commonly observed in unified multi-task training for audio-visual tasks, where task heterogeneity causes performance on nearly 55% of tasks to fall below single-task baselines. To mitigate this, the authors propose an explicit collaboration mechanism that jointly aligns heterogeneous tasks and models inter-task relationships at both data and model levels. Key innovations include the construction of the AV-UIE v2 dataset featuring explicit reasoning processes, the design of an Interaction-aware LoRA (I-LoRA) dynamic routing mechanism, and the integration of unified instruction tuning with task-granularity alignment interfaces. This approach achieves positive transfer for the first time in multi-task audio-visual understanding, outperforming single-task baselines on 88% of tasks and significantly surpassing existing unified and task-specific models.

Technology Category

Application Category

📝 Abstract
Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datasets and 7 tasks, enabling the model to capture cross-task relationships at different levels of granularity. On the model side, we design a unified interface to align heterogeneous task formulations, and propose Interaction-aware LoRA (I-LoRA), which explicitly models inter-task relationships via dynamic routing to coordinate distinct audio-visual interaction patterns, mitigating parameter interference. Extensive experiments show Crab$^{+}$ covers broader tasks than existing unified models while outperforming specialized models on various benchmarks. We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks. These results hold across diverse AV-LLM paradigms and are validated through in-depth visualization, positioning Crab$^{+}$ as a robust step towards holistic audio-visual scene understanding.
Problem

Research questions and friction points this paper is trying to address.

negative transfer
audio-visual scene understanding
task heterogeneity
multimodal intelligence
multi-task learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Large Language Model
Negative Transfer Mitigation
Explicit Task Cooperation
Interaction-aware LoRA
Unified Instruction Tuning
🔎 Similar Papers
No similar papers found.
D
Dongnuan Cai
Gaoling School of Artificial Intelligence, Renmin University of China
H
Henghui Du
Gaoling School of Artificial Intelligence, Renmin University of China
C
Chang Zhou
AI Technology Center, Online Video Business Unit, Tencent PCG
Xi Chen
Xi Chen
Tencent Inc.
Natural Language ProcessingKnowledge GraphMachine Learning
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence
Hongyuan Zhang
Hongyuan Zhang
The University of Hong Kong
Representation LearningMultimodal LearningGraph Neural NetworksOptimization
X
Xuelong Li
Institute of Artificial Intelligence of China Telecom (TeleAI)
Di Hu
Di Hu
Tenure-track Associate Professor, Renmin University of China
Multimodal PerceptionMultimodal LearningMultimodal Interaction