A Proximal Operator for Inducing 2:4-Sparsity

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing structured pruning methods for AI accelerators—particularly 2:4 sparsity—suffer from accuracy degradation due to the two-stage prune-then-fine-tune paradigm and lack of hardware-aware end-to-end sparse training. Method: This paper proposes a sparse training framework that integrates local feature correlation into mask optimization. Its core innovation is the first analytically solvable proximal operator for 2:4 sparsity, enabling exact projection onto the 2:4 constraint set. The method embeds localized squared loss minimization into mask optimization and couples it with a mask-gradient update mechanism, ensuring hardware compatibility while improving model fidelity. Contribution/Results: By eliminating the performance gap inherent in post-hoc pruning, the approach achieves end-to-end sparse training. Experiments show it significantly outperforms state-of-the-art pruning baselines on a 13B LLM and matches dense baseline accuracy on a 70B model—demonstrating both effectiveness and scalability for ultra-large language models.

Technology Category

Application Category

📝 Abstract

Recent hardware advancements in AI Accelerators and GPUs allow to efficiently compute sparse matrix multiplications, especially when 2 out of 4 consecutive weights are set to zero. However, this so-called 2:4 sparsity usually comes at a decreased accuracy of the model. We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models. We minimize the regularizer jointly with a local squared loss by deriving the proximal operator for which we show that it has an efficient solution in the 2:4-sparse case. After optimizing the mask, we use maskedgradient updates to further minimize the local squared loss. We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance.

Problem

Research questions and friction points this paper is trying to address.

Model Accuracy

Hardware Acceleration

Sparse Parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Optimization

Large-scale Model

Parameter Efficiency

🔎 Similar Papers

A new Linear Time Bi-level ℓ1,∞ projection ; Application to the sparsification of auto-encoders neural networks