Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the significant performance degradation of audio deepfake detection models under cross-dataset deployment due to distributional shift. To enhance cross-domain generalization without access to target-domain labels, the authors propose an unsupervised domain adaptation framework that integrates pretrained Wav2Vec 2.0 embeddings with interpretable statistical transformations. The pipeline sequentially applies power transformation normalization, ANOVA-based feature selection, joint PCA for dimensionality reduction, and CORAL for covariance alignment, followed by logistic regression classification. Evaluated on bidirectional transfer tasks between ASVspoof and Fake-or-Real datasets, the method achieves accuracies of 62.7%–63.6%, representing a 10.7% improvement over the baseline. These results demonstrate the effectiveness and interpretability of the proposed modular architecture in mitigating domain shift in audio deepfake detection.

Technology Category

Application Category

📝 Abstract

Audio deepfake detection systems trained on one dataset often fail when deployed on data from different sources due to distributional shifts in recording conditions, synthesis methods, and acoustic environments. We present a modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data. Our approach applies power transformation for feature normalization, ANOVA-based feature selection, joint PCA for domain-agnostic dimensionality reduction, and CORAL alignment to match source and target covariance structures before classification via logistic regression. We evaluate on two cross-domain transfer scenarios: ASVspoof 2019 LA to Fake-or-Real (FoR) and FoR to ASVspoof, achieving 62.7--63.6\% accuracy with balanced performance across real and fake classes. Systematic ablation experiments reveal that feature selection (+3.5%) and CORAL alignment (+3.2%) provide the largest individual contributions, with the complete pipeline improving accuracy by 10.7% over baseline. While performance is modest compared to within-domain detection (94-96%), our pipeline offers transparency and modularity, making it suitable for deployment scenarios requiring interpretable decisions.

Problem

Research questions and friction points this paper is trying to address.

Audio Deepfake Detection

Unsupervised Domain Adaptation

Distributional Shift

Cross-domain Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Domain Adaptation

Audio Deepfake Detection

Statistical Transformations

CORAL Alignment