Representation-Based Exploration for Language Models: From Test-Time to Post-Training

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates whether reinforcement learning (RL) can enable language models to discover *novel* behaviors—not merely amplify pre-existing capabilities acquired during pretraining. To this end, we propose a hidden-state–based exploration mechanism: semantic similarity is computed from intermediate-layer representations of the pretrained model to construct a diversity-oriented intrinsic reward, which is then integrated into an RL framework to guide both inference-time exploration and post-training fine-tuning. Crucially, our method requires no architectural modifications and significantly improves behavioral diversity and sample efficiency. Experiments show over 50% inference efficiency gain on Qwen-2.5-14B-Instruct and a threefold improvement in sample efficiency on the AIME 2024 benchmark using Qwen-2.5-7B-Instruct. Our core contribution is the first use of hidden-state representations to directly formulate a generalizable intrinsic exploration reward, thereby unifying behavior discovery across both inference and training phases.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.

Problem

Research questions and friction points this paper is trying to address.

Enhancing language model diversity through representation-based exploration

Improving reasoning efficiency via inference-time exploration strategies

Boosting post-training performance with principled RL exploration methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation-based bonus from hidden states

Exploration improves diversity and pass rates

Applies to both post-training and inference-time scaling

🔎 Similar Papers

No similar papers found.