Beyond Identity: A Generalizable Approach for Deepfake Audio Detection

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deepfake audio detection models suffer from poor generalization, primarily due to implicit learning of speaker identity cues rather than genuine acoustic artifacts intrinsic to synthesis—termed the *identity leakage problem*. This work is the first to systematically identify and address this issue, proposing an identity-agnostic deepfake detection framework. We introduce Artifact Detection Modules (ADMs) and novel artifact augmentation strategies—including dynamic frequency-domain swapping, time-domain perturbation, and background noise injection—to isolate and model artifacts across both time and frequency domains. Evaluated on ADD 2022, FoR, and In-The-Wild benchmarks, our method achieves F1 scores of 0.230, 0.604, and 0.813, respectively, substantially outperforming prior baselines. Ablation studies confirm that dynamic frequency-domain swapping yields the most robust improvement in generalization.

Technology Category

Application Category

📝 Abstract
Deepfake audio presents a growing threat to digital security, due to its potential for social engineering, fraud, and identity misuse. However, existing detection models suffer from poor generalization across datasets, due to implicit identity leakage, where models inadvertently learn speaker-specific features instead of manipulation artifacts. To the best of our knowledge, this is the first study to explicitly analyze and address identity leakage in the audio deepfake detection domain. This work proposes an identity-independent audio deepfake detection framework that mitigates identity leakage by encouraging the model to focus on forgery-specific artifacts instead of overfitting to speaker traits. Our approach leverages Artifact Detection Modules (ADMs) to isolate synthetic artifacts in both time and frequency domains, enhancing cross-dataset generalization. We introduce novel dynamic artifact generation techniques, including frequency domain swaps, time domain manipulations, and background noise augmentation, to enforce learning of dataset-invariant features. Extensive experiments conducted on ASVspoof2019, ADD 2022, FoR, and In-The-Wild datasets demonstrate that the proposed ADM-enhanced models achieve F1 scores of 0.230 (ADD 2022), 0.604 (FoR), and 0.813 (In-The-Wild), consistently outperforming the baseline. Dynamic Frequency Swap proves to be the most effective strategy across diverse conditions. These findings emphasize the value of artifact-based learning in mitigating implicit identity leakage for more generalizable audio deepfake detection.
Problem

Research questions and friction points this paper is trying to address.

Addressing poor generalization in deepfake audio detection models
Mitigating identity leakage by focusing on forgery-specific artifacts
Enhancing cross-dataset performance with artifact detection modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identity-independent framework mitigates speaker-specific overfitting
Artifact Detection Modules isolate synthetic time-frequency artifacts
Dynamic artifact generation enforces dataset-invariant feature learning
🔎 Similar Papers
2024-04-22arXiv.orgCitations: 25
Y
Yasaman Ahmadiadli
Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada.
X
Xiao-Ping Zhang
Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada.
Naimul Khan
Naimul Khan
Associate Professor, Toronto Metropolitan University (Ryerson University)
Signal ProcessingMedical ImagingMachine LearningAugmented/Virtual Reality