SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional audio-visual speaker verification relies heavily on large-scale labeled data and modality-specific architectures, resulting in high computational overhead and poor generalization. To address these limitations, we propose the first unified self-supervised framework for audio-visual speaker verification: it employs a shared visual Transformer backbone and integrates contrastive learning, asymmetric masking, and masked data modeling to jointly process audio, video, and audio-visual inputs—naturally accommodating missing-modality scenarios. Unlike conventional modality-isolated designs, our approach eliminates the need for labeled data while achieving performance on par with fully supervised methods. It significantly reduces both computational cost and data dependency. Experiments demonstrate strong robustness across multimodal inputs, alongside high efficiency, scalability, and cross-modal consistency. Our framework establishes a novel paradigm for low-resource speaker verification.

Technology Category

Application Category

📝 Abstract
Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.
Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on labeled data for speaker verification
Unifies audio-visual inputs with single shared backbone
Improves robustness to missing modalities efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised contrastive learning with asymmetric masking
Unified shared backbone for audio-visual inputs
Vision transformers for robust missing modality handling
🔎 Similar Papers
No similar papers found.