Generalizable Audio Spoofing Detection using Non-Semantic Representations

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the proliferation of generative audio attacks and the poor generalization of existing deepfake detection methods, this paper proposes a generalizable detection framework based on semantic-agnostic universal audio representations. Methodologically, it abandons semantic-dependent and handcrafted features, and instead systematically employs self-supervised speech representation models—namely TRILL and TRILLsson—to extract robust, non-semantic audio embeddings, which are then fed into a lightweight classifier for spoof detection. Experiments demonstrate that the approach achieves in-domain performance comparable to state-of-the-art (SOTA) methods, while significantly outperforming them in cross-domain and public benchmark settings—including ASVspoof 2019 Logical Access and In-the-Wild—especially under unknown synthesis algorithms and channel distortions. The core contribution lies in empirically validating and establishing the critical role of semantic-agnostic universal representations in enhancing the robustness and generalizability of audio forgery detection.

Technology Category

Application Category

📝 Abstract

Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.

Problem

Research questions and friction points this paper is trying to address.

Detecting synthetic audio spoofing attacks on speech services

Improving generalizability of deepfake detection to real-world data

Leveraging non-semantic audio representations for robust countermeasures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-semantic audio representations for detection

TRILL and TRILLsson models extract features

Superior generalization on out-of-domain data

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey