🤖 AI Summary
This work addresses the high deployment cost of large language models for low-resource European languages, such as Polish, by applying NVIDIA’s Minitron compression paradigm to the Bielik-11B-v3.0 model for the first time. The authors propose a two-stage compression approach that integrates structured mixed pruning with logit-based knowledge distillation, complemented by a multi-stage alignment strategy combining supervised fine-tuning (SFT), DPO-P, and GRPO. The resulting Bielik-Minitron-7B model reduces parameter count by 33.4% (to 7.35B) while retaining approximately 90% of the original model’s performance and achieving up to 50% faster inference, thereby substantially improving deployment efficiency.
📝 Abstract
This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.