LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-stage reasoning enhances small language models (SLMs) but incurs substantial latency; existing adaptive acceleration techniques—e.g., layer skipping—struggle to balance efficiency and accuracy due to stage-dependent sensitivity to skipping and redundant token generation. This paper proposes LiteStage, the first framework integrating stage-aware layer skipping with confidence-based generation early exit. It employs offline stage-level layer budget allocation to optimize skipping policies and online confidence-guided early termination to suppress redundant decoding—all without fine-tuning, enabling efficient, lightweight deployment. Evaluated on OBQA, CSQA, and StrategyQA, LiteStage achieves up to 1.70× speedup with ≤4.0% accuracy degradation, significantly outperforming state-of-the-art training-free layer-skipping methods.

Technology Category

Application Category

📝 Abstract
Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
Problem

Research questions and friction points this paper is trying to address.

Reducing latency in multi-stage reasoning models
Addressing stage-wise variation in skip sensitivity
Minimizing generation of redundant output tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-wise offline search allocates optimal layer budgets
Online confidence-based generation enables early exit
Latency-aware layer skipping accelerates multi-stage reasoning
🔎 Similar Papers
No similar papers found.