🤖 AI Summary
To address insufficient semantic alignment in cross-modal video–text retrieval, this paper proposes the Modality-Aware Concept (MAC) mechanism: leveraging foundation models to automatically distill modality-specific visual and textual labels, and explicitly disentangling and aligning their underlying concept representations within a shared latent space. The method integrates label distillation, multimodal joint alignment, auxiliary concept learning, and contrastive loss optimization, thereby enhancing both cross-modal discriminability and interpretability. Comprehensive evaluation across five benchmarks—MSR-VTT, DiDeMo, TGIF, MSVD, and ActivityNet—demonstrates consistent improvements: state-of-the-art performance on three datasets, while matching or surpassing prior methods on the remaining two. To our knowledge, this is the first work to incorporate modality-specific concepts into a cross-modal alignment framework, establishing a new paradigm for interpretable and discriminative representation learning.
📝 Abstract
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts, derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so are able to distinguish concepts from one other. We conduct extensive experiments on five diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across the other two.