Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text/video-to-audio generation methods suffer from limitations in audio fidelity, cross-modal alignment accuracy, and controllability over duration and loudness; video-driven approaches often require additional alignment-specific training. This paper introduces the first LLM-augmented diffusion-based audio agent framework for stepwise, instruction-guided generation, enabling high-fidelity, long-duration, multi-event audio synthesis and fine-grained editing from either text or video input. Key innovations include: (1) a timestamp-free semantic-temporal joint alignment strategy; and (2) an integrated architecture combining the TTA diffusion model, GPT-4 for instruction decomposition, fine-tuned Gemma2-2B-it, cross-modal conditional encoding, and a multi-stage agent scheduling mechanism. Experiments demonstrate state-of-the-art performance on both text-to-audio (TTA) and video-to-audio (VTA) tasks, with low training overhead, zero-shot editability, and robust variable-length generation capability.

Technology Category

Application Category

📝 Abstract
We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time-consuming. Instead, we propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.
Problem

Research questions and friction points this paper is trying to address.

Audio Synthesis
Content Alignment
Flexibility in Audio Adjustment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Agent
Pre-trained Large Language Model
Integrated Audio-Video Processing
🔎 Similar Papers
No similar papers found.