🤖 AI Summary
Existing text-to-audio (T2A) methods struggle with temporally precise prompts (e.g., “an owl hoots from 2.4–5.2 seconds”) due to the scarcity of high-quality time-aligned training data and the need for model fine-tuning. This work introduces the first training-free, controllable long-duration T2A generation framework. It leverages a large language model to automatically decompose complex prompts into non-overlapping temporal windows and rewrite natural-language descriptions accordingly. Temporal alignment is achieved through attention decoupling and aggregation control, contextual latent variable composition, and reference-guided synthesis. Without any parameter updates or task-specific training, our method enables coherent audio generation spanning tens of seconds. Quantitative and qualitative evaluations demonstrate that it matches state-of-the-art supervised models—such as Stable Audio—in both temporal precision and audio fidelity, while significantly outperforming other training-free approaches.
📝 Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/