Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the high computational cost and overfitting risks of full-parameter fine-tuning in large language models, as well as the limited inference speedup offered by existing parameter-efficient adaptation methods such as LoRA. The authors propose a structured sparse fine-tuning mechanism based on learnable stochastic gating, which directly induces row- and column-level sparsity in model weights during training, replacing conventional weight updates. Requiring only a small number of trainable parameters, the method maintains competitive performance while removing 20%–40% of the original model parameters, thereby significantly reducing inference latency. Theoretical analysis provides convergence guarantees and demonstrates improved conditioning of the optimization landscape. Experimental results show that the proposed approach outperforms mainstream low-rank adaptation baselines in both fine-tuning efficiency and inference speed.

Technology Category

Application Category

📝 Abstract

Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20--40\% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.

Problem

Research questions and friction points this paper is trying to address.

model finetuning

structured sparsity

inference efficiency

parameter compression

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured sparsity

stochastic gating

parameter-efficient finetuning