Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of long-form, controllable audio generation in complex scenes—requiring coordinated synthesis of speech, sound effects, music, temporal structure, and post-production processing—by proposing Audio-Oscar, the first multi-agent collaborative framework for this task. Audio-Oscar employs specialized agents responsible for character modeling, speech synthesis, fine-grained timeline planning, model orchestration, non-speech audio generation, and post-processing, integrated within a feedback-driven optimization loop. To facilitate systematic evaluation, we also introduce ASG-Bench, the first audio scene generation benchmark with temporal annotations. Experimental results demonstrate that Audio-Oscar significantly outperforms existing approaches in both content accuracy and temporal coherence.
📝 Abstract
In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a multi-agent framework for generating audio from complex descriptions. Audio-Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production. Audio-Oscar further incorporates feedback-driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct \textbf{ASG-Bench}, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text-only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio-Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio-Oscar.
Problem

Research questions and friction points this paper is trying to address.

audio generation
complex audio scenes
long-form audio
controllable audio
audio scene description
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent system
audio scene generation
feedback-driven refinement
temporal planning
ASG-Bench
🔎 Similar Papers
No similar papers found.
Y
Yifan Duan
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University; Shanghai Innovation Institute
Q
Qixiang Xu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
H
Hengtao Wu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Z
Zhanxun Liu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University; Shanghai Innovation Institute; Shanghai AI Laboratory
Wenhao Guan
Wenhao Guan
Xiamen University
speech
J
Junxi Liu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Ziyang Ma
Ziyang Ma
Shanghai Jiao Tong University
Speech and Language ProcessingTextless NLPSelf-supervised LearningMultimedia
K
Kelu Xu
State Key Laboratory of Complex & Critical Software Environment, China
Xie Chen
Xie Chen
Shanghai Jiao Tong University <- Microsoft <- Cambridge University
Machine LearningSpeech RecognitionSpeech SynthesisSpeech&Audio Processing