Taming Modality Entanglement in Continual Audio-Visual Segmentation

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal continual learning approaches focus on coarse-grained tasks and struggle with fine-grained scenarios involving modality entanglement—particularly in audio-guided continual segmentation, where semantic drift (e.g., vocalizing objects misclassified as background) and co-occurrence confusion (mutual misclassification of frequently co-occurring classes) arise. This work introduces, for the first time, the **continual audio-visual segmentation task**, establishing a novel multimodal continual learning paradigm tailored to fine-grained class-incremental settings. We propose a **multimodal sample selection strategy** and a **collision-detection-based sample replay mechanism**, jointly optimizing audio-visual stream representations to mitigate modality entanglement. Evaluated under three audio-visual incremental learning protocols, our method significantly outperforms unimodal baselines across all metrics, demonstrating both effectiveness and robustness in preserving discriminative cross-modal semantics during continual learning.

Technology Category

Application Category

📝 Abstract
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses modality entanglement in fine-grained continual learning
Solves multi-modal semantic drift in sequential audio-visual tasks
Mitigates co-occurrence confusion between frequently appearing classes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collision-based Multi-modal Rehearsal framework addresses modality entanglement
Multi-modal Sample Selection strategy ensures modal consistency
Collision-based Sample Rehearsal increases frequency of confusable classes
🔎 Similar Papers
No similar papers found.
Y
Yuyang Hong
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Q
Qi Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
T
Tao Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Zili Wang
Zili Wang
StepFun LLM Researcher & M-A-P
Large Language ModelsCode Intelligence
Z
Zhaojin Fu
School of Intelligent Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
Kun Ding
Kun Ding
CASIA
CVMultimodal
B
Bin Fan
School of Intelligent Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
Shiming Xiang
Shiming Xiang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Distance Metric LearningSemi-supervised LearningManifold LearningRegressionFeature Selection