Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing Audio Reasoning Models (ARMs) trained via standard Reasoning Training (RT) exhibit high vulnerability to advanced audio jailbreaking attacks, frequently generating harmful outputs. This work is the first to systematically expose this critical security vulnerability and proposes Rebellion, a robust reasoning training framework. Rebellion jointly constructs safety-oriented adversarial audio examples and optimizes representation-space stability by explicitly modeling worst-case adversarial representation drift. Crucially, it achieves enhanced attack resilience without compromising performance on benign tasks. Extensive experiments on Qwen2-Audio demonstrate that Rebellion significantly outperforms RT: it markedly improves defense success rates against diverse state-of-the-art audio jailbreaking attacks while preserving over 98% of original task accuracy. Thus, Rebellion establishes a superior safety–accuracy trade-off for audio reasoning models.

Technology Category

Application Category

📝 Abstract

Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs'performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.

Problem

Research questions and friction points this paper is trying to address.

Protecting audio reasoning models from advanced jailbreak attacks

Addressing representation drift between vanilla and advanced audio jailbreaks

Improving safety-accuracy trade-off in audio reasoning model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-robust reasoning training for audio models

Protection against worst-case representation drift attacks

Maintains benign task performance while enhancing safety

🔎 Similar Papers

No similar papers found.