Structured Pruning for Diverse Best-of-N Reasoning Optimization

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited reasoning capabilities on complex mathematical tasks. Method: We propose SPRINT, a contrastive learning–based framework for dynamic, structured pruning of attention heads. During inference, SPRINT adaptively selects optimal head-layer combinations to enable task-driven Best-of-N path optimization. Contribution/Results: We empirically demonstrate—contrary to conventional wisdom—that selective head pruning can enhance, rather than degrade, LLM reasoning performance. SPRINT introduces attention-head embedding alignment and dynamic pruning to preserve semantic consistency and path validity post-pruning. Evaluated on MATH500 and GSM8K, SPRINT significantly outperforms standard Best-of-N sampling and random pruning, achieving up to a 4.2-percentage-point absolute accuracy gain. This validates that dynamically sparse inference paths effectively augment LLMs’ capacity for complex mathematical reasoning.

Technology Category

Application Category

📝 Abstract

Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model's reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning via selective pruning in transformers

Dynamic optimal head selection for pruning during inference

Improve accuracy on MATH500 and GSM8K datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective pruning improves reasoning performance

Dynamic head and layer pruning via SPRINT

Aligns question embeddings with head embeddings

🔎 Similar Papers

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization