GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of modeling fine-grained audio-visual interactions in active speaker detection (ASD) under unconstrained settings—where conventional late fusion fails—we propose HiGate, a hierarchical gated cross-modal fusion architecture that adaptively injects audio-visual contextual features across Transformer layers for deep cross-modal collaboration. Our method builds upon strong pre-trained unimodal encoders and a Transformer backbone, incorporating multi-depth gated feature injection. We introduce two novel auxiliary supervision signals: Mask Alignment Loss (MAL) and Over-Positive Penalty (OPP), which jointly suppress visual false activations and enhance modality consistency. Evaluated on Ego4D-ASD, UniTalk, and WASD, HiGate achieves state-of-the-art mAP scores of 77.8%, 86.1%, and 96.1%, respectively, and demonstrates superior cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
Problem

Research questions and friction points this paper is trying to address.

Improves cross-modal interaction for speaker detection
Introduces hierarchical gated fusion for robust performance
Proposes auxiliary objectives to enhance multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical gated fusion for cross-modal interaction
Auxiliary objectives to strengthen multimodal learning
Progressive feature injection guided by learnable gates
🔎 Similar Papers
No similar papers found.
Y
Yu Wang
Indiana University, Bloomington, Indiana, USA
Juhyung Ha
Juhyung Ha
Ph.D. student, Indiana University
Computer Vision
Frangil M. Ramirez
Frangil M. Ramirez
Ph.D. Student, Indiana University Bloomington
computer visionvideo understandingcognition
Y
Yuchen Wang
Indiana University, Bloomington, Indiana, USA
D
David J. Crandall
Indiana University, Bloomington, Indiana, USA