AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of lacking a unified modeling framework, high-quality data, and efficient inference in multimodal audio generation by proposing a unified and efficient framework. The approach integrates conditional signals from text, video, and audio through a multimodal adaptive fusion module and leverages flow matching optimization, distribution-matching distillation, and a diffusion discriminator to achieve high-fidelity, low-latency audio synthesis in just four sampling steps. Trained on a newly curated dataset, IF-caps-Pro, comprising 9.2 million samples, the model achieves state-of-the-art performance in text-to-audio and text-to-music tasks. It reduces computational cost during inference by approximately 25× compared to conventional multi-step methods, substantially improving both efficiency and instruction-following capability.
📝 Abstract
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.
Problem

Research questions and friction points this paper is trying to address.

anything-to-audio generation
multimodal control
diffusion sampling
audio synthesis
efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal audio generation
flow matching distillation
diffusion transformer
few-step synthesis
large-scale audio dataset
🔎 Similar Papers
No similar papers found.