Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of effectively integrating textual modality into audio-visual representation learning. We propose Language-Guided Contrastive Audio-Visual Masked Autoencoding (LAV-MAE). Its core innovations are: (1) unsupervised construction of high-quality audio-visual-text triplets—frame-level captions are generated via an image captioning model, and a CLAP-driven alignment filtering mechanism selects samples with strong semantic correspondence; (2) integration of a pretrained text encoder into a contrastive learning framework to enable cross-modal joint representation alignment. Evaluated on audio-visual retrieval and classification benchmarks, LAV-MAE achieves state-of-the-art performance: up to +5.6% improvement in Recall@10 for retrieval and +3.2% accuracy gain for classification over prior methods. These results empirically validate that language guidance substantially enhances multimodal representation learning.

Technology Category

Application Category

📝 Abstract

In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.

Problem

Research questions and friction points this paper is trying to address.

Improving audio-visual representation learning with text integration

Automatically generating audio-visual-text triplets from videos

Enhancing performance in audio-visual retrieval and classification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided contrastive audio-visual masked autoencoder

Automatic generation of audio-visual-text triplets

CLAP-based filtering for strong audio-caption alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow