Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

📅 2024-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often exhibit decreased post-correction accuracy when attempting to self-correct erroneous answers. This work formally decouples self-correction capability into two orthogonal dimensions: *confidence ability*—the propensity to initiate correction—and *critical ability*—the capacity to transform incorrect answers into correct ones—and proposes three probabilistically grounded, quantifiable evaluation metrics. Through systematic behavioral enumeration, prompt engineering, and in-context learning analysis, we uncover an intrinsic trade-off between these two abilities. To mitigate this tension, we introduce a structured supervised fine-tuning (SFT) data format reconstruction strategy. Empirical validation across multiple LLMs demonstrates that our approach significantly enhances the synergy between confidence and critical abilities, yielding substantial improvements in post-correction accuracy over baseline methods.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction. Our code will be publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Self-correction
Accuracy Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Error Correction
Large Language Model
Training Data Format Modification
🔎 Similar Papers
No similar papers found.
Z
Zhe Yang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yichang Zhang
Yichang Zhang
Qwen Team, Alibaba Group
NLPReinforcement LearningDeep LearningMachine LearningArtificial Intelligence
Y
Yudong Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ziyao Xu
Ziyao Xu
Peking University
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University