FORTE: FOL-guided Optimal Refinement for Text-audio rEtrieval

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of fine-grained semantic alignment in text–audio cross-modal retrieval, which is hindered by the inherent modality gap. To tackle this issue, the authors propose FORTE, a unified framework that uniquely integrates symbolic logical reasoning with representation learning. FORTE enhances semantic invariance and discriminability through first-order logic–guided query refinement and employs a lightweight, parameter-efficient cross-modal projection for alignment. Additionally, it introduces a predicate-aware re-ranking mechanism to enforce logical consistency among retrieved results. Extensive experiments on the AudioCaps and Clotho datasets demonstrate that FORTE significantly outperforms strong existing baselines, achieving particularly notable gains in fine-grained retrieval scenarios.

📝 Abstract

Text-to-audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine-grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter-efficient cross-modal alignment to improve retrieval precision. Our approach first transforms queries into first-order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate-aware re-ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine-grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross-modal retrieval.

Problem

Research questions and friction points this paper is trying to address.

text-to-audio retrieval

modality gap

fine-grained semantic alignment

cross-modal retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

first-order logic

cross-modal retrieval

semantic refinement

parameter-efficient alignment