Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world multimodal person identification often suffers from missing or degraded modalities (e.g., single- or dual-modality failure). To address this, we propose a robust tri-modal identification framework leveraging speech, face, and gesture modalities. Our method introduces two novel components: (1) a confidence-weighted dynamic fusion mechanism that adaptively adjusts modality contributions under partial modality failure, and (2) a cross-modal gated attention architecture that models inter-modal dependencies and suppresses unreliable signals. We further employ multi-task learning and cross-attention to explicitly capture complementary information across modalities. Evaluated on the interview-scenario dataset CANDOR—where we establish the first benchmark—we achieve 99.18% Top-1 accuracy with full tri-modal input, 99.92% on VoxCeleb1 under dual-modal conditions, and significantly outperform conventional unimodal and late-fusion baselines even with single-modality input. These results demonstrate superior robustness and generalization capability under heterogeneous modality availability.

Technology Category

Application Category

📝 Abstract
Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Robust person identification with missing modalities
Integrates voice, face, and gesture using adaptive fusion
Maintains high accuracy in unimodal or bimodal scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trimodal framework integrates voice, face, and gesture modalities
Cross-attention and gated fusion enable interaction across modalities
Confidence-weighted fusion adapts to missing or low-quality data
🔎 Similar Papers
No similar papers found.