LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

πŸ“… 2024-12-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address interaction noise caused by temporal misalignment between audio and visual modalities in weakly supervised audio-visual video analysis, this paper proposes LINKβ€”a novel framework that (1) dynamically adjusts the contribution of misaligned modalities via a learnable weighting mechanism, and (2) suppresses cross-modal noise using pseudo-label-guided semantic priors. LINK is the first method to achieve interpretable and adaptive balancing of modality interactions. It jointly integrates weakly supervised learning, temporal event localization, and cross-modal interaction modeling, enabling precise joint localization and classification of visible, audible, and bimodal events. Evaluated on the LLP benchmark, LINK achieves state-of-the-art performance in both event classification and boundary localization, significantly outperforming existing weakly supervised approaches.

Technology Category

Application Category

πŸ“ Abstract
Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Synchronization
Weakly-Labeled Videos
Content Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

LINK method
audio-video synchronization
LLP data handling
πŸ”Ž Similar Papers
No similar papers found.
L
Langyu Wang
School of Logistics Engineering, Shanghai Maritime University, China
Bingke Zhu
Bingke Zhu
Institute of Automation,Chinese Academy of Science
Y
Yingying Chen
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, China