To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of person retrieval in real-world broadcast videos, where target individuals often lack either audio or visual modalities, causing fixed multimodal fusion strategies to suffer from interference by invalid modalities and reduced retrieval accuracy. To overcome this limitation, the authors propose a query-adaptive audio-visual person retrieval framework that introduces, for the first time, an active modality detection mechanism. This mechanism automatically identifies effective modalities for each query by evaluating cross-modal score consistency and dynamically selects the optimal retrieval strategy accordingly. Evaluated on the BBC Rewind dataset, the proposed method achieves a P@1 accuracy of 94.2%, significantly outperforming both unimodal and fixed-fusion baselines, and bridges 64% of the performance gap toward the ideal oracle.

📝 Abstract

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

person retrieval

active modality detection

audio-visual fusion

query-adaptive

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-adaptive retrieval

active modality detection

audio-visual person retrieval