🤖 AI Summary
To address the challenges of large parameter count, slow inference, and poor deployability in learned image compression (LIC) models, this paper proposes a lightweighting method based on a Swin-V2 teacher model and Feature-Entropy Dual-driven Distillation (FEDS). Our approach innovatively introduces an entropy-weighted channel-level distillation mechanism that jointly aligns attention-aware features and models latent-space channel importance guided by information entropy. Furthermore, we design a three-stage progressive knowledge transfer framework to enhance distillation efficacy. The resulting student model achieves only marginal BD-Rate degradations of +1.24%, +1.17%, and +0.55% on the Kodak, TECNICK, and CLIC benchmarks, respectively, while reducing parameters by 63% and accelerating encoding/decoding by 73%. Notably, the method demonstrates strong generalization across diverse Transformer-based LIC architectures.
📝 Abstract
Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules, additional residual blocks, and expanded latent channels, thus achieving enhanced compression performance. Building on this foundation, we propose a underline{F}eature and underline{E}ntropy-based underline{D}istillation underline{S}trategy ( extbf{FEDS}) that transfers key knowledge from the teacher to a lightweight student model. Specifically, we align intermediate feature representations and emphasize the most informative latent channels through an entropy-based loss. A staged training scheme refines this transfer in three phases: feature alignment, channel-level distillation, and final fine-tuning. Our student model nearly matches the teacher across Kodak (1.24% BD-Rate increase), Tecnick (1.17%), and CLIC (0.55%) while cutting parameters by about 63% and accelerating encoding/decoding by around 73%. Moreover, ablation studies indicate that FEDS generalizes effectively to transformer-based networks. The experimental results demonstrate our approach strikes a compelling balance among compression performance, speed, and model parameters, making it well-suited for real-time or resource-limited scenarios.