π€ AI Summary
This work addresses codebook collapse in large-scale vector quantization, a phenomenon characterized by unassigned codewords and increased quantization error, and identifies encoder drift as the primary underlying cause. To mitigate this issue, the authors propose Non-stationary-aware Vector Quantization (NSVQ), a novel training strategy that employs a non-stationary embedding loss to guide the codebook in tracking early-stage encoder drift. NSVQ integrates dynamic codebook replacement and phased encoder freezing to first enable joint optimization and subsequently stabilize the architecture, followed by adversarial fine-tuning to disrupt the feedback loop of quantization error. Evaluated on ImageNet-1k at 128Γ128 resolution, NSVQ reduces the reconstruction FrΓ©chet Inception Distance (rFID) from 2.39 to 2.10, achieves 100% codebook utilization, and substantially enhances the generation quality of downstream latent diffusion models.
π Abstract
Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.