Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing audio-visual pretraining methods focus solely on holistic modality alignment, neglecting fine-grained discrimination among easily confusable activity categories—thereby limiting human activity recognition performance. To address this, we propose the Dynamic Inter-class Confusion-Aware Encoder (DICCAE), which explicitly models inter-class confusion relationships and introduces a dynamically adjusted confusion-aware loss. DICCAE integrates audio-video contrastive learning, clustering-driven pseudo-label generation, fine-grained class-level cross-modal alignment, and a dynamic loss weighting mechanism within a cluster-guided self-supervised pretraining framework. This design jointly enhances cross-modal semantic alignment and category discriminability. On the VGGSound dataset, DICCAE achieves 65.5% top-1 accuracy—approaching state-of-the-art performance. Ablation studies comprehensively validate the effectiveness and necessity of each component.

Technology Category

Application Category

📝 Abstract

Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

Problem

Research questions and friction points this paper is trying to address.

Enhancing audio-video fusion for distinguishing similar human activities

Addressing category confusion via dynamic inter-class confusion loss

Mitigating data scarcity with self-supervised pre-training strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Inter-Class Confusion-Aware Encoder (DICCAE)

Cluster-guided audio-video self-supervised pre-training

Fine-grained category-level audio-video alignment

🔎 Similar Papers

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization