BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deepfake speech detection, this paper proposes a dual-branch spectral-temporal fusion architecture: one branch models sub-band spectral features, while the other captures local temporal dynamics; these branches are deeply fused via bidirectional Mamba-driven reciprocal cross-attention. We further introduce a novel 2D-convolution-based attention map mechanism to precisely localize forgery-sensitive regions and enable end-to-end modeling directly on raw waveforms. Evaluated on the ASVSpoof2021 LA and DF benchmarks, our method achieves relative error rate reductions of 67.74% and 26.3% over AASIST, respectively, and further improves upon RawBMamba by 6.80% on DF21. The proposed framework significantly enhances both detection robustness and interpretability, offering a principled approach to waveform-level deepfake detection with explicit spatial-temporal localization capabilities.

Technology Category

Application Category

📝 Abstract
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Detects synthetic speech using spectro-temporal cross-attention
Improves deepfake detection with bidirectional Mamba blocks
Enhances performance on ASVSpoof benchmarks significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Mamba blocks enhance feature processing
Spectro-temporal cross-attention integrates dual-branch representations
Convolution-based 2D attention maps focus key regions
🔎 Similar Papers
No similar papers found.