RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the multilingual cross-modal face-voice association task, specifically tackling semantic alignment challenges in English-German scenarios. We propose a novel fusion framework comprising two key components: (1) an inter-modal feature reweighting mechanism that enhances semantically shared representations across face and voice modalities while bridging language gaps; and (2) an orthogonal projection constraint that suppresses modality-specific noise and improves semantic comparability between heterogeneous biometric features. Evaluated on the FAME 2026 Challenge English-German test set, our method achieves an Equal Error Rate (EER) of 33.1%, ranking third. The core contribution lies in the first integration of orthogonal projection with semantic-aware fusion, significantly improving consistency modeling of cross-modal features in multilingual settings. This approach advances robust multimodal representation learning under linguistic diversity.

Technology Category

Application Category

📝 Abstract
Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.
Problem

Research questions and friction points this paper is trying to address.

Face-voice association in multilingual environments
Fusion and orthogonal projection methods revisited
Focusing on relevant semantic information across modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fusion and orthogonal projection for semantic alignment
Focus on relevant semantic information across modalities
Optimized for multilingual English-German face-voice pairs
🔎 Similar Papers
No similar papers found.
A
Abdul Hannan
University of Trento, Italy
F
Furqan Malik
Saxion University of Applied Sciences, Netherlands
H
Hina Jabbar
University of Education, Lahore, Pakistan
S
Syed Suleman Sadiq
Mubashir Noman
Mubashir Noman
MBZUAI
Image ProcessingObject Tracking / ClassificationRemote Sensing Change DetectionComputer