Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of limited cross-lingual generalization and performance degradation due to missing facial modality in multilingual speaker recognition by proposing the MRAF framework. The method introduces learnable missing tokens as a replacement for zero-padding to mitigate distributional shifts caused by modality absence and incorporates a reliability-aware cross-attention mechanism to adaptively fuse audio and visual modalities within a unified token space. By integrating multi-branch classification loss, audio-based knowledge distillation, and center loss, the model achieves 100% accuracy on the P3 and P5 tasks of the POLY-SIM 2026 test set and demonstrates state-of-the-art performance under more challenging face-missing conditions (P4 and P6).

📝 Abstract

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.

Problem

Research questions and friction points this paper is trying to address.

polyglot speaker identification

missing modality

cross-lingual generalization

reliability-aware fusion

multimodal robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

missing-token prompting

reliability-aware fusion

polyglot speaker identification