🤖 AI Summary
To address the poor generalization of audio deepfake detection models (e.g., Wav2Vec2) fine-tuned on fixed training sets to unseen forgery methods, this paper proposes Mixture-of-LoRA Experts (MoLoRA). MoLoRA embeds multiple low-rank adapters into attention layers and employs a dynamic routing mechanism to selectively activate task-specialized experts, enabling adaptive modeling of novel forgery patterns. The backbone parameters remain entirely frozen, ensuring computational efficiency and scalability. Experiments demonstrate that MoLoRA significantly outperforms standard fine-tuning in both in-domain and out-of-domain settings. Specifically, the best-performing model reduces the average out-of-domain equal error rate (EER) from 8.55% to 6.08%, substantially enhancing robustness against previously unseen attacks. This improvement underscores MoLoRA’s effectiveness in mitigating domain shift and improving zero-shot generalization for audio deepfake detection.
📝 Abstract
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in training. To address this, we propose a mixture-of-LoRA-experts approach that integrates multiple low-rank adapters (LoRA) into the model's attention layers. A routing mechanism selectively activates specialized experts, enhancing adaptability to evolving deepfake attacks. Experimental results show that our method outperforms standard fine-tuning in both in-domain and out-of-domain scenarios, reducing equal error rates relative to baseline models. Notably, our best MoE-LoRA model lowers the average out-of-domain EER from 8.55% to 6.08%, demonstrating its effectiveness in achieving generalizable audio deepfake detection.