Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods jointly optimize contrastive alignment and masked reconstruction objectives, which often introduces semantic noise and causes optimization interference, thereby limiting cross-modal representation learning performance. This work proposes the TG-DP framework, which decouples reconstruction and alignment tasks along separate optimization paths for the first time. Each path employs a visibility pattern tailored to its specific objective, and a teacher model is introduced to guide the organization of visible tokens in the contrastive path, reducing interference and enhancing representation quality. The proposed method achieves significant improvements in zero-shot retrieval on AudioSet—R@1 increases from 35.2% to 37.4% (video→audio) and from 27.9% to 37.1% (audio→video)—and attains state-of-the-art linear probe performance on both AS20K and VGGSound benchmarks.
📝 Abstract
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Problem

Research questions and friction points this paper is trying to address.

semantic noise
audio-visual representation learning
contrastive alignment
masked reconstruction
optimization interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-path learning
teacher-guided representation
audio-visual pretraining
semantic noise reduction
contrastive alignment
🔎 Similar Papers
No similar papers found.
L
Linge Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Y
Yingying Chen
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Bingke Zhu
Bingke Zhu
Institute of Automation,Chinese Academy of Science
L
Lu Zhou
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Objecteye Inc., Beijing, China