SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the deployment challenges of LLM-based TTS models—namely their large parameter count and high inference latency—this paper proposes an efficient compression framework. It employs a word error rate (WER)-guided layer importance metric to drive structured pruning, coupled with multi-level adaptive knowledge distillation to restore autoregressive coherence. Using less than 5% of the original training data, the method halves the number of Transformer layers, reduces VRAM consumption by 20%, and improves the real-time factor to 1.7×. In zero-shot speech synthesis, the compressed model retains naturalness and speaker similarity comparable to the full model. The core contribution lies in integrating task-aware importance estimation with hierarchical knowledge transfer—achieving, for the first time in LLM-TTS, high-fidelity deep compression with minimal computational overhead.

Technology Category

Application Category

📝 Abstract
The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.
Problem

Research questions and friction points this paper is trying to address.

Reducing large parameter counts and high latency in LLM-TTS systems
Maintaining perceptual quality while significantly improving efficiency
Enabling practical real-time speech generation with compact models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured pruning guided by word-error-rate importance
Multi-level knowledge distillation for autoregressive coherence
Halving Transformer depth while preserving perceptual quality
🔎 Similar Papers
No similar papers found.
Tan Dat Nguyen
Tan Dat Nguyen
Student, KAIST
Machine LearningSpeech SynthesisAudio GenerationGenerative AI
J
Jaehun Kim
Korea Advanced Institute of Science and Technology, South Korea
J
Ji-Hoon Kim
Korea Advanced Institute of Science and Technology, South Korea
S
Shukjae Choi
242dot Inc., South Korea
Y
Youshin Lim
242dot Inc., South Korea
Joon Son Chung
Joon Son Chung
KAIST
Machine learningspeech processingcomputer vision