Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

๐Ÿ“… 2025-11-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing Audio Reasoning Models (ARMs) trained via standard Reasoning Training (RT) exhibit high vulnerability to advanced audio jailbreaking attacks, frequently generating harmful outputs. This work is the first to systematically expose this critical security vulnerability and proposes Rebellion, a robust reasoning training framework. Rebellion jointly constructs safety-oriented adversarial audio examples and optimizes representation-space stability by explicitly modeling worst-case adversarial representation drift. Crucially, it achieves enhanced attack resilience without compromising performance on benign tasks. Extensive experiments on Qwen2-Audio demonstrate that Rebellion significantly outperforms RT: it markedly improves defense success rates against diverse state-of-the-art audio jailbreaking attacks while preserving over 98% of original task accuracy. Thus, Rebellion establishes a superior safetyโ€“accuracy trade-off for audio reasoning models.

Technology Category

Application Category

๐Ÿ“ Abstract
Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs'performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.
Problem

Research questions and friction points this paper is trying to address.

Protecting audio reasoning models from advanced jailbreak attacks
Addressing representation drift between vanilla and advanced audio jailbreaks
Improving safety-accuracy trade-off in audio reasoning model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-robust reasoning training for audio models
Protection against worst-case representation drift attacks
Maintains benign task performance while enhancing safety
๐Ÿ”Ž Similar Papers
No similar papers found.