Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

📅 2023-07-28

🏛️ Interspeech

📈 Citations: 7

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing audio–text retrieval methods rely solely on coarse-grained clip–sentence contrastive learning, neglecting fine-grained cross-modal alignments (e.g., frame–word and clip–phrase). To address this, we propose a three-level hierarchical interaction framework that explicitly models frame–word, clip–phrase, and segment–sentence alignments. We further introduce an auxiliary caption-driven dual-branch matching mechanism: leveraging pre-trained vision–language models (e.g., BLIP) to generate high-quality pseudo-captions, thereby enabling simultaneous representation enhancement and data augmentation. This work establishes the first multi-granularity cross-modal interaction paradigm for audio–text retrieval, integrating hierarchical contrastive learning, cross-modal attention, and joint optimization. Experiments on AudioCaps and Clotho demonstrate significant improvements in Recall@1: the HCI module yields an average +4.2% gain, while the full AC framework consistently improves performance by 2.1%–3.5%.

📝 Abstract

Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhance audio-text retrieval via hierarchical cross-modal interaction

Leverage auxiliary captions to improve audio representations

Address fine-grained cross-modal relationships in audio-text pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cross-modal interaction for multi-level relationships

Auxiliary captions enhance audio representations via feature interaction

Generated captions form new pairs for data augmentation

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation