🤖 AI Summary
This study addresses the faithfulness problem in large language models’ (LLMs) chain-of-thought (CoT) reasoning—i.e., the frequent misalignment between generated reasoning steps and the model’s actual decision process. We conduct the first systematic comparison of Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for improving CoT faithfulness, empirically evaluating both methods across the Qwen2.5 model family (0.5B–14B). Results show that GRPO consistently outperforms DPO, achieving state-of-the-art faithfulness on larger models (e.g., Qwen2.5-14B-Instruct); faithfulness gains scale positively with model size, revealing GRPO’s untapped potential in large-scale settings; conversely, both methods exhibit instability on smaller models. Our work establishes a new empirical benchmark for CoT faithfulness optimization and provides methodological insights into alignment-aware preference optimization for reasoning transparency.
📝 Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.