Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limited transferability of existing black-box audio adversarial attacks and their vulnerability to waveform-level defenses. The authors propose a novel attack method that operates in the self-supervised learning (SSL) feature space: adversarial perturbations are crafted in the SSL-based acoustic-phonetic representation and then reconstructed into speech-like waveforms via a vocoder, thereby evading waveform-level defenses and enhancing cross-model transferability. Using only Whisper-small as a surrogate model, the proposed approach achieves a +26.6% absolute increase in Word Error Rate (WER) across multiple black-box automatic speech recognition (ASR) systems and demonstrates robustness against various defense mechanisms, yielding a +36.2% WER improvement—significantly outperforming current state-of-the-art methods.

📝 Abstract

Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

automatic speech recognition

black-box transferability

defense robustness

waveform perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-space adversarial attack

Self-supervised learning (SSL) representations

Vocoder-based reconstruction