๐ค AI Summary
Existing audio-text retrieval (ATR) methods suffer from the gradient locality bottleneck (GLB) induced by small-batch contrastive learning, limiting their capacity to model fine-grained and long-tail semantics beyond the batch. While external knowledge enhancement alleviates GLB, it introduces representation drift mismatch (RDM)โa misalignment between static knowledge bases and dynamically evolving model representations. To address both challenges jointly, this paper proposes the Adaptive Self-Optimizing Knowledge (ASOK) framework, the first to systematically decouple and co-resolve GLB and RDM. ASOK integrates three core innovations: (1) multi-granularity knowledge injection, (2) dynamic graph neural networkโdriven knowledge refinement, and (3) an adaptive reliability-weighting mechanism under cross-modal embedding alignment. Evaluated on AudioCaps and Clotho benchmarks, ASOK achieves state-of-the-art performance, with significant gains in retrieval accuracy for fine-grained and long-tail samples.
๐ Abstract
The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.