🤖 AI Summary
Audio-language models (ALMs) face a critical security threat from stealthy, universal, and robust audio jailbreaking attacks—where imperceptible first-person toxic speech features exploit weaknesses in ALM multimodal alignment mechanisms, leading to cross-task, cross-audio-type, and real-world acoustic misclassifications.
Method: We propose the first universal audio-modal jailbreaking framework, integrating adversarial perturbation synthesis, cross-prompt/task/sample generalization design, realistic acoustic environment simulation, and internal representation attribution analysis.
Contribution/Results: We identify semantic-phonetic coupling mismatch as the root cause of ALM alignment failure. Our method achieves high attack success rates and strong cross-model robustness, maintaining effectiveness under simulated real-world acoustic conditions. It transcends text-based jailbreaking paradigms, establishing a foundational methodology for audio-modal jailbreaking research and opening a new direction for multimodal safety evaluation.
📝 Abstract
The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.