Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing closed-set models, which struggle to reject audio generated by unknown speech synthesizers and often produce overconfident misattributions for unseen sources. To overcome these challenges, the authors propose a dual-branch gated fusion framework that jointly leverages self-supervised speech representations (XLSR-53) and 66-dimensional handcrafted acoustic features (CORES). The framework employs input-conditioned gating to adaptively fuse the two modalities and introduces an energy margin loss alongside gating diversity regularization to mitigate representation imbalance. Evaluated on the MLAAD benchmark, the method achieves a known-source accuracy of 97.6% and an EERc of 4.9%, reducing the FPR95 by 83.5% relative to the Interspeech 2025 baseline, thereby substantially enhancing open-set generalization.

📝 Abstract

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

Problem

Research questions and friction points this paper is trying to address.

open-set

audio deepfake

source tracing

synthesizer attribution

out-of-distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-branch gated fusion

open-set audio deepfake tracing

CORES descriptor