Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address audio deepfake detection under scarce labeled data, this paper proposes a novel paradigm integrating spectrogram-based spatiotemporal graph modeling with unsupervised graph non-contrastive learning (Graph NCL). Spectrograms are partitioned into spatiotemporal image patches to construct node-relational graphs; a visual graph convolutional encoder is designed and pre-trained via Graph NCL in a fully unsupervised manner, followed by fine-tuning of a lightweight detection head using minimal labeled samples. This work introduces Graph NCL to audio deepfake detection for the first time, substantially reducing reliance on manual annotations. Experiments demonstrate that the method achieves state-of-the-art equal error rate (EER) using only 5% labeled data. It attains the lowest EER across cross-domain evaluations on ASVspoof2019, ASVspoof2021, and the challenging In-The-Wild benchmark, validating both effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.
Problem

Research questions and friction points this paper is trying to address.

Audio Deepfake Detection
Limited Labeled Data
Reduced Dependency on Manual Annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SIGNL
Non-contrastive Learning
Audio Deepfake Detection
🔎 Similar Papers
No similar papers found.