đ€ AI Summary
Existing structured pruning methods for AI acceleratorsâparticularly 2:4 sparsityâsuffer from accuracy degradation due to the two-stage prune-then-fine-tune paradigm and lack of hardware-aware end-to-end sparse training.
Method: This paper proposes a sparse training framework that integrates local feature correlation into mask optimization. Its core innovation is the first analytically solvable proximal operator for 2:4 sparsity, enabling exact projection onto the 2:4 constraint set. The method embeds localized squared loss minimization into mask optimization and couples it with a mask-gradient update mechanism, ensuring hardware compatibility while improving model fidelity.
Contribution/Results: By eliminating the performance gap inherent in post-hoc pruning, the approach achieves end-to-end sparse training. Experiments show it significantly outperforms state-of-the-art pruning baselines on a 13B LLM and matches dense baseline accuracy on a 70B modelâdemonstrating both effectiveness and scalability for ultra-large language models.
đ Abstract
Recent hardware advancements in AI Accelerators and GPUs allow to efficiently compute sparse matrix multiplications, especially when 2 out of 4 consecutive weights are set to zero. However, this so-called 2:4 sparsity usually comes at a decreased accuracy of the model. We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models. We minimize the regularizer jointly with a local squared loss by deriving the proximal operator for which we show that it has an efficient solution in the 2:4-sparse case. After optimizing the mask, we use maskedgradient updates to further minimize the local squared loss. We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance.