Training Language Model to Critique for Better Refinement

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing work lacks a systematic investigation into which types of critique most effectively improve model response quality. To address this, we propose Refinement-oriented Critique Optimization (RCO), a critique optimization framework explicitly designed to enhance subsequent responses. RCO introduces Critique Utility (CU)—a learnable, automated reward signal that quantifies the actual improvement a critique induces in the refined response. It establishes a critique–refinement closed loop, integrating reinforcement learning with fine-grained preference supervision to enable iterative training without human-annotated preferences. Empirically, RCO significantly improves the instructiveness and practical utility of critique models. It consistently outperforms strong baselines across five diverse tasks—including dialogue generation, summarization, and question answering—while simultaneously enhancing both critique quality and the magnitude of response improvement.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce extbf{R}efinement-oriented extbf{C}ritique extbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method's effectiveness in enhancing LLM critique-refinement loops.
Problem

Research questions and friction points this paper is trying to address.

Optimizing critique types for better model response refinement
Training critic models using refinement signals effectively
Enhancing critique quality across multiple LLM tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refinement-oriented Critique Optimization framework
Feedback loop with critique utility reward
Training critic models via refinement signals
🔎 Similar Papers
No similar papers found.
Tianshu Yu
Tianshu Yu
The Chinese University of Hong Kong, Shenzhen
Machine LearningOptimizationAI4Science
Chao Xiang
Chao Xiang
University of Hong Kong
silicon photonicssemiconductor lasersphotonic integrated circuits
M
Mingchuan Yang
China Telecom Research Institute
Pei Ke
Pei Ke
Associate Professor, University of Electronic Science and Technology of China
Natural Language ProcessingNatural Language GenerationDialogue SystemLarge Language Model
Bosi Wen
Bosi Wen
Tsinghua University
Natural Language Processing
Cunxiang Wang
Cunxiang Wang
Tsinghua University; ZhipuAI
Large Language ModelsLLM EvaluationLLM Post-training
J
Jiale Cheng
The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University
L
Li Zhang
China Telecom Research Institute
X
Xinyu Mu
China Telecom Research Institute
C
Chuxiong Sun
China Telecom Research Institute
M
Minlie Huang
The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University