🤖 AI Summary
To address the inconsistency and insufficient robustness of large language models (LLMs) in reasoning, this paper proposes a queue-level consistency reinforcement learning framework. It generates semantically similar question queues via programmatic abstraction and jointly optimizes queue-level accuracy, retrieval-augmented gain, and rejection penalty to enforce uniform reasoning paths across questions within each queue. We introduce the first queue-level consistency modeling paradigm and design an end-to-end differentiable multi-objective composite reward function, overcoming the limitation of supervised fine-tuning in capturing cross-sample reasoning constraints. Built upon the PPO algorithm and retrieval-augmented reasoning, our method achieves significant improvements in both accuracy and reasoning stability on ARC-Challenge and StrategyQA, consistently outperforming pretrained and supervised fine-tuning baselines.
📝 Abstract
Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.