Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the faithfulness problem in large language models’ (LLMs) chain-of-thought (CoT) reasoning—i.e., the frequent misalignment between generated reasoning steps and the model’s actual decision process. We conduct the first systematic comparison of Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for improving CoT faithfulness, empirically evaluating both methods across the Qwen2.5 model family (0.5B–14B). Results show that GRPO consistently outperforms DPO, achieving state-of-the-art faithfulness on larger models (e.g., Qwen2.5-14B-Instruct); faithfulness gains scale positively with model size, revealing GRPO’s untapped potential in large-scale settings; conversely, both methods exhibit instability on smaller models. Our work establishes a new empirical benchmark for CoT faithfulness optimization and provides methodological insights into alignment-aware preference optimization for reasoning transparency.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates GRPO and DPO for faithful Chain-of-Thought reasoning in LLMs

Addresses misleading justifications in CoT that undermine reliability

Aims to improve transparency and trustworthiness in model reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GRPO and DPO for faithful CoT reasoning

GRPO outperformed DPO in larger models like Qwen2.5-14B

GRPO improves faithfulness metrics for transparent LLM reasoning

🔎 Similar Papers

Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners