๐ค AI Summary
Deploying high-quality text-to-speech (TTS) models on resource-constrained edge devices demands simultaneous optimization of synthesis fidelity and model footprint. This paper proposes an ultra-lightweight TTS compression framework featuring two key innovations: (1) the first 1.58-bit quantization-aware training (QAT) for edge TTSโeffectively approximating ternary weights {โ1, 0, 1}; and (2) Weight Indexing, a novel technique that maps multiple low-bit weight groups to a single int8 index, drastically reducing both memory storage and computational overhead. Evaluated on standard benchmarks, our method achieves 83% model size reduction while preserving speech naturalness and outperforming non-quantized baselines of comparable size. The core contributions are: (i) the first edge-deployable TTS system supporting 1.58-bit QAT; (ii) an efficient int8-indexed weight representation scheme; and (iii) real-time, high-fidelity speech synthesis on severely resource-limited hardware.
๐ Abstract
This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.