CATPO: Critique-Augmented Tree Policy Optimization

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inefficiency in tree-based reinforcement learning caused by the generation of numerous low-information reasoning trees, which wastes computational resources and hampers training efficacy. To mitigate this, the authors propose a tree-level information evaluation mechanism that integrates critique-guided repair of failed nodes with an information-weighted loss function, thereby enhancing training efficiency while preserving total gradient magnitude. The approach synergistically combines tree-structured trajectory sampling, natural language critique generation, and policy-reward decorrelation analysis to enable more effective weighted policy optimization. Evaluated on Qwen2.5-Math-1.5B, the method achieves a macro accuracy of 37.5%, outperforming TreeRPO and GRPO by 1.9% and 4.8%, respectively, demonstrating significant gains in mathematical reasoning performance.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

Problem

Research questions and friction points this paper is trying to address.

tree-based reinforcement learning

uninformative trees

compute waste

reward signal inefficiency

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Critique-Augmented

Tree Informativeness Scoring

Policy-Reward Decorrelation