🤖 AI Summary
This work addresses the significant distributional discrepancy and structural misalignment between audio-visual and textual modalities in generalized zero-shot learning (GZSL). To mitigate these challenges, the authors propose a novel approach that integrates Z-score normalization with a three-level hierarchical alignment mechanism. By normalizing fused audio-visual and textual embeddings and jointly optimizing alignment at the semantic, class, and batch levels within a shared embedding space, the method effectively reduces inter-modal distributional shifts while preserving semantic relationships and intra-batch spatial consistency. Evaluated on three standard benchmarks—VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL—the proposed framework achieves competitive performance. Notably, it is the first to incorporate both normalization and multi-level alignment into audio-visual GZSL, substantially enhancing the robustness and generalization of learned representations.
📝 Abstract
Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.