TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM sparsification methods rely on post-hoc calibration or Hessian recovery, hindering direct identification of deployable subnetworks. To address this, we propose TwIST—a distributed parallel, end-to-end sparsification framework. TwIST dynamically samples and concurrently optimizes multiple subnetworks during training; periodic parameter aggregation and structural resampling enable direct discovery of high-quality “lottery-ticket” subnetworks, achieving zero-cost pruning. The resulting structured dense weights are hardware-agnostic—requiring no fine-tuning, calibration, or specialized sparse accelerators. Experiments demonstrate that at >50% sparsity, TwIST achieves a perplexity of 23.14 on WikiText-2, substantially outperforming prior SOTA (31.64), while enabling efficient inference and memory compression.

Technology Category

Application Category

📝 Abstract
We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.
Problem

Research questions and friction points this paper is trying to address.

Enables zero-cost pruning for LLMs without post-training procedures
Identifies high-quality subnetworks under aggressive sparsity conditions
Produces structured sparse models for efficient inference on commodity hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains multiple subnetworks in parallel during training
Aggregates and resamples parameters to find golden tickets
Enables zero-cost pruning with structured dense matrices
🔎 Similar Papers
No similar papers found.
M
Michael Menezes
Department of Computer Science, Rice University, Texas, USA
B
Barbara Su
Department of Computer Science, Rice University, Texas, USA
X
Xinze Feng
Department of Computer Science, Rice University, Texas, USA
Y
Yehya Farhat
Department of Computer Science, Rice University, Texas, USA
H
H. Shili
Department of Computer Science, Rice University, Texas, USA
Anastasios Kyrillidis
Anastasios Kyrillidis
Noah Harding CS Associate Professor, Ken Kennedy Institute Fellow, Dean Fellow, Rice University
OptimizationNonconvex optimizationAI AgentsContinual LearningMixture of experts