🤖 AI Summary
This work addresses the limitations of current large language models in effectively modeling the diversity of reasoning paths and the transition dynamics between steps during complex mathematical reasoning, which constrains their overall reasoning capability. To overcome this, the paper introduces the concept of “thinking schemata,” formally representing reasoning diversity as an optimizable structure, and establishes a novel paradigm that jointly enhances diversity during both training and inference. Building upon this foundation, the authors propose DiScO (Diverse Schemata Optimization), a framework integrating schema-aware mechanisms, reinforcement learning–driven diversity rewards, and inference-time diverse sampling strategies. Experimental results demonstrate that DiScO significantly outperforms existing policy optimization approaches across multiple mathematical reasoning benchmarks, with human evaluations further confirming its enhanced capacity for error recovery and exploration of multiple reasoning pathways.
📝 Abstract
Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.