🤖 AI Summary
To address chain-of-thought degradation in large language models during testing—caused by distributional shift and the absence of supervision—this paper proposes BraRL, a branch-point-aware adaptive test-time reinforcement learning framework. Methodologically, BraRL (1) identifies high-entropy branching tokens via forward statistical analysis to enable selective policy updates at critical reasoning nodes; (2) introduces entropy-band regularization to balance exploration and noise suppression under label-free conditions; and (3) integrates self-consistency pseudo-rewards, a GRPO-style optimization objective, and KL-divergence anchoring for stable policy refinement. Evaluated across ten diverse reasoning tasks, BraRL achieves significant improvements in Pass@1 (+4.2% on average), avoids response truncation, and exhibits more stable training dynamics compared to existing test-time RL approaches.
📝 Abstract
Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.