SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address chain-of-thought degradation in large language models during testing—caused by distributional shift and the absence of supervision—this paper proposes BraRL, a branch-point-aware adaptive test-time reinforcement learning framework. Methodologically, BraRL (1) identifies high-entropy branching tokens via forward statistical analysis to enable selective policy updates at critical reasoning nodes; (2) introduces entropy-band regularization to balance exploration and noise suppression under label-free conditions; and (3) integrates self-consistency pseudo-rewards, a GRPO-style optimization objective, and KL-divergence anchoring for stable policy refinement. Evaluated across ten diverse reasoning tasks, BraRL achieves significant improvements in Pass@1 (+4.2% on average), avoids response truncation, and exhibits more stable training dynamics compared to existing test-time RL approaches.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.
Problem

Research questions and friction points this paper is trying to address.

Addresses test-time reinforcement learning collapse in reasoning models
Solves uniform sequence updates ignoring high-entropy branch points
Prevents response-length collapse and Pass@1 decline during adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Updates only high-entropy forking tokens
Applies entropy-band regularization at tokens
Enhances reasoning without labels or reward models
🔎 Similar Papers
No similar papers found.