🤖 AI Summary
This work addresses the challenge of fine-grained semantic alignment in text–audio cross-modal retrieval, which is hindered by the inherent modality gap. To tackle this issue, the authors propose FORTE, a unified framework that uniquely integrates symbolic logical reasoning with representation learning. FORTE enhances semantic invariance and discriminability through first-order logic–guided query refinement and employs a lightweight, parameter-efficient cross-modal projection for alignment. Additionally, it introduces a predicate-aware re-ranking mechanism to enforce logical consistency among retrieved results. Extensive experiments on the AudioCaps and Clotho datasets demonstrate that FORTE significantly outperforms strong existing baselines, achieving particularly notable gains in fine-grained retrieval scenarios.
📝 Abstract
Text-to-audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine-grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter-efficient cross-modal alignment to improve retrieval precision. Our approach first transforms queries into first-order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate-aware re-ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine-grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross-modal retrieval.