Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing four key challenges in audio-visual event perception—poor open-vocabulary generalization, high cost of cross-modal temporal annotation, weak adaptation to dynamic distribution shifts, and insufficient late-stage multimodal interaction—this paper proposes the first training-free open-vocabulary framework. Our method introduces: (1) the first training-free open-vocabulary benchmark; (2) an intra-video label drift algorithm for temporal self-adaptive dynamic thresholding; and (3) score-level multimodal fusion to preserve fine-grained audio-visual interactions. The approach is entirely model-agnostic, integrating zero-shot and weakly supervised transfer mechanisms. Evaluated under both zero-shot and weakly supervised settings, our framework significantly outperforms state-of-the-art methods in event localization and classification, demonstrating the effectiveness and robustness of training-free paradigms in open, dynamically evolving video scenarios.

Technology Category

Application Category

📝 Abstract
In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($ ext{AV}^2 ext{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $ ext{AV}^2 ext{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $ ext{AV}^2 ext{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $ ext{AV}^2 ext{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Generalizing to unseen event categories in audio-visual perception.
Reducing labor-intensive annotation for multimodal event localization.
Adapting to dynamic event distributions in video analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free audio-visual event perception model.
Dynamic thresholds adapt to changing video dynamics.
Score-level fusion retains rich multimodal interactions.
🔎 Similar Papers
No similar papers found.