Hybrid Latent Reasoning via Reinforcement Learning

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Implicit reasoning in large language models (LLMs) suffers from a fundamental mismatch with autoregressive token generation due to discreteness, while existing approaches rely heavily on manually annotated chain-of-thought (CoT) trajectories. Method: We propose a novel hybrid implicit reasoning framework based on a PPO variant of reinforcement learning. It introduces a learnable gating mechanism that dynamically fuses discrete tokens with continuous hidden states, enabling end-to-end training without CoT supervision. Stochastic sampling provides exploration for RL optimization, and the framework supports progressive feature fusion and implicit state reweighting. Contribution/Results: Our method consistently outperforms prior implicit reasoning approaches on knowledge- and reasoning-intensive benchmarks, generating shorter, more coherent, and more interpretable responses. It also demonstrates strong cross-lingual generalization, validating its robustness and scalability beyond monolingual settings.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Problem

Research questions and friction points this paper is trying to address.

Bridging latent reasoning with autoregressive LLMs via RL

Enabling hybrid reasoning using discrete and continuous representations

Optimizing latent reasoning without requiring CoT training traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid reasoning via reinforcement learning

Learnable gating mechanism for hidden states

Progressive training with token embeddings

🔎 Similar Papers

No similar papers found.