🤖 AI Summary
This work addresses the challenge that reinforcement learning with sparse scalar rewards struggles to effectively correct erroneous reasoning in large language models and often leads to shortcut learning. To overcome this limitation, the authors propose SocraticPO, a framework that integrates Socratic-style natural language guidance into policy optimization. In this approach, a student model first generates an answer independently; if incorrect, a black-box, stronger teacher model provides a concise diagnostic and corrective hint. A reward decay mechanism is simultaneously employed to incentivize the student to produce correct answers autonomously. SocraticPO uniquely incorporates interactive natural language feedback and reward decay into the standard policy gradient objective without altering the optimization target. Evaluated on the SciKnowEval benchmark for undergraduate-level scientific reasoning, SocraticPO substantially outperforms strong reinforcement learning and self-distillation baselines, with ablation studies confirming the necessity of both guidance and reward decay components.
📝 Abstract
Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.