TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM sparsification methods rely on post-hoc calibration or Hessian recovery, hindering direct identification of deployable subnetworks. To address this, we propose TwIST—a distributed parallel, end-to-end sparsification framework. TwIST dynamically samples and concurrently optimizes multiple subnetworks during training; periodic parameter aggregation and structural resampling enable direct discovery of high-quality “lottery-ticket” subnetworks, achieving zero-cost pruning. The resulting structured dense weights are hardware-agnostic—requiring no fine-tuning, calibration, or specialized sparse accelerators. Experiments demonstrate that at >50% sparsity, TwIST achieves a perplexity of 23.14 on WikiText-2, substantially outperforming prior SOTA (31.64), while enabling efficient inference and memory compression.

Technology Category

Application Category

📝 Abstract

We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

Problem

Research questions and friction points this paper is trying to address.

Enables zero-cost pruning for LLMs without post-training procedures

Identifies high-quality subnetworks under aggressive sparsity conditions

Produces structured sparse models for efficient inference on commodity hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains multiple subnetworks in parallel during training

Aggregates and resamples parameters to find golden tickets

Enables zero-cost pruning with structured dense matrices

🔎 Similar Papers

No similar papers found.

Authors to Follow