🤖 AI Summary
This work addresses the multilingual cross-modal face-voice association task, specifically tackling semantic alignment challenges in English-German scenarios. We propose a novel fusion framework comprising two key components: (1) an inter-modal feature reweighting mechanism that enhances semantically shared representations across face and voice modalities while bridging language gaps; and (2) an orthogonal projection constraint that suppresses modality-specific noise and improves semantic comparability between heterogeneous biometric features. Evaluated on the FAME 2026 Challenge English-German test set, our method achieves an Equal Error Rate (EER) of 33.1%, ranking third. The core contribution lies in the first integration of orthogonal projection with semantic-aware fusion, significantly improving consistency modeling of cross-modal features in multilingual settings. This approach advances robust multimodal representation learning under linguistic diversity.
📝 Abstract
Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.