Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks systematic, quantitative security evaluations of jailbreaking attacks against large audio-language models (LAMs). Method: We introduce AJailBench—the first jailbreak evaluation benchmark tailored for LAMs—comprising 1,495 adversarial audio prompts spanning 10 policy-violating categories. We further propose the Audio Perturbation Toolkit (APT), which jointly leverages semantic consistency constraints and Bayesian optimization to generate stealthy, effective, and semantically preserved dynamic adversarial examples across time, frequency, and amplitude domains. Contribution/Results: Experiments reveal severe vulnerabilities across mainstream LAMs under AJailBench-APT: minute perturbations substantially degrade safety performance. This work establishes a standardized benchmark, a high-quality adversarial dataset, and a reproducible methodology to advance robustness assessment and defense research for LAMs.

Technology Category

Application Category

📝 Abstract
The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Evaluating jailbreak vulnerabilities in Large Audio-Language Models (LAMs)
Generating dynamic adversarial variants for realistic attack simulations
Assessing safety performance of LAMs under subtle semantic-preserving perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

AJailBench benchmark for LAM jailbreak evaluation
Dynamic adversarial variants with Audio Perturbation Toolkit
Bayesian optimization for subtle, effective audio perturbations
🔎 Similar Papers
No similar papers found.
Zirui Song
Zirui Song
PhD student in MBZUAI
NLP
Qian Jiang
Qian Jiang
Northeastern University
ANYTHING I am interested in
M
Mingxuan Cui
Mohamed bin Zayed University of Artificial Intelligence
M
Mingzhe Li
ByteDance
Lang Gao
Lang Gao
MBZUAI
Mechanistic InterpretabilityNatural Language Processing
Z
Zeyu Zhang
Australia National University
Z
Zixiang Xu
Mohamed bin Zayed University of Artificial Intelligence
Y
Yanbo Wang
Mohamed bin Zayed University of Artificial Intelligence
C
Chenxi Wang
Mohamed bin Zayed University of Artificial Intelligence
Guangxian Ouyang
Guangxian Ouyang
Northeastern University
Embodied AI
Zhenhao Chen
Zhenhao Chen
MBZUAI
CausalityMachine LearningRepresentation LearningLLMMultimodal AI
Xiuying Chen
Xiuying Chen
MBZUAI
Trustworthy NLPHuman-Centered NLPComputational Social Science