Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of eliciting long-chain reasoning capabilities—such as backtracking and self-correction—in large language models (LLMs) without modifying their parameters. To this end, we propose ThinkLogit, a decoding-time logit arithmetic intervention method that leverages a small, pre-trained reasoning model (e.g., R1-Distill-Qwen-1.5B) as a “reasoning guide” to dynamically rectify the output logits of a large non-reasoning LLM (e.g., Qwen2.5-32B). ThinkLogit requires no fine-tuning of the target LLM, enables cross-architecture transfer, and is orthogonal to preference optimization. We further integrate it with DPO to train the guiding model, yielding the ThinkLogit-DPO framework. On five reasoning benchmarks, ThinkLogit and ThinkLogit-DPO boost the average accuracy of Qwen2.5-32B by 24.5% and 29.1%, respectively—demonstrating, for the first time, a purely decoding-time, zero-training paradigm for unlocking complex reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.
Problem

Research questions and friction points this paper is trying to address.

Eliciting long chain-of-thought reasoning without additional training
Improving large non-reasoning models using smaller reasoning guides
Enhancing reasoning accuracy across different model families and benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Logit arithmetic enables reasoning without training
Small reasoning model guides large non-reasoning model
Preference optimization boosts performance with reasoning pairs
🔎 Similar Papers
No similar papers found.