π€ AI Summary
This work addresses the challenge of person retrieval in real-world broadcast videos, where target individuals often lack either audio or visual modalities, causing fixed multimodal fusion strategies to suffer from interference by invalid modalities and reduced retrieval accuracy. To overcome this limitation, the authors propose a query-adaptive audio-visual person retrieval framework that introduces, for the first time, an active modality detection mechanism. This mechanism automatically identifies effective modalities for each query by evaluating cross-modal score consistency and dynamically selects the optimal retrieval strategy accordingly. Evaluated on the BBC Rewind dataset, the proposed method achieves a P@1 accuracy of 94.2%, significantly outperforming both unimodal and fixed-fusion baselines, and bridges 64% of the performance gap toward the ideal oracle.
π Abstract
When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).