UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limitations of conventional audio generation systems, which are typically designed for single tasks and struggle to handle diverse requirements such as speech synthesis, general sound generation, and complex scene-level editing within a unified framework. The authors propose a unified latent diffusion architecture that integrates intermediate-layer representations from a frozen multimodal large language model (LLM) into corresponding MM-DiT modules via a depth-aware fusion mechanism. Task differentiation is achieved through channel masking, enabling the same set of weights to support multiple capabilities—including text-to-audio generation, speech synthesis, zero-shot speaker cloning, mixed audio generation, and scene-level editing. With only 621M–732M trainable parameters, the model matches or exceeds the performance of specialized systems while occupying approximately one-quarter the size of existing unified approaches.

📝 Abstract

We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.

Problem

Research questions and friction points this paper is trying to address.

unified audio generation

speech synthesis

sound generation

audio editing

multimodal modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep LLM fusion

unified audio generation

multi-task architecture