GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenge in zero-shot text-to-speech synthesis where conventional speaker prompts entangle speaker identity with prosodic attributes such as speaking rate and pitch, hindering independent style control. The authors propose a decoupled control approach based on post-generation rewards, employing Group Relative Policy Optimization (GRPO) to train lightweight LoRA adapters. These adapters are optimized using speech duration and average fundamental frequency as style-relevant rewards, while word error rate (WER) serves as an intelligibility constraint. This study presents the first integration of GRPO with LoRA for zero-shot TTS, enabling independent training of attribute-specific adapters and supporting interpolation and multi-axis composition through linear arithmetic without fine-tuning the backbone model. Experiments demonstrate that the system achieves precise control over speaking rate and pitch while preserving naturalness, speaker similarity, and intelligibility, and validate smooth composability across adapters.

📝 Abstract

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

Problem

Research questions and friction points this paper is trying to address.

zero-shot text-to-speech

acoustic style control

speaker prompt entanglement

prosodic attributes

style steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA

zero-shot TTS

acoustic style control