🤖 AI Summary
To address the challenge of deepfake speech detection, this paper proposes a dual-branch spectral-temporal fusion architecture: one branch models sub-band spectral features, while the other captures local temporal dynamics; these branches are deeply fused via bidirectional Mamba-driven reciprocal cross-attention. We further introduce a novel 2D-convolution-based attention map mechanism to precisely localize forgery-sensitive regions and enable end-to-end modeling directly on raw waveforms. Evaluated on the ASVSpoof2021 LA and DF benchmarks, our method achieves relative error rate reductions of 67.74% and 26.3% over AASIST, respectively, and further improves upon RawBMamba by 6.80% on DF21. The proposed framework significantly enhances both detection robustness and interpretability, offering a principled approach to waveform-level deepfake detection with explicit spatial-temporal localization capabilities.
📝 Abstract
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.