CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM reasoning fine-tuning faces two key challenges: (1) reinforcement learning (RL)-based methods suffer from training instability and performance degradation due to neglecting annotated chain-of-thought (CoT) supervision and suffering from sampling variance; (2) supervised fine-tuning (SFT) over-relies on scarce annotated CoTs, limiting discovery of latent effective reasoning paths. To address these, we propose CoT-CL—a contrastive learning framework that unifies annotated CoT supervision with RL-style optimization. CoT-CL introduces contrastive signals grounded in semantic similarity among CoT representations, jointly optimizing supervised and unsupervised objectives. It incorporates explicit CoT representation learning and path consistency regularization to enhance training stability and generalization. Extensive experiments demonstrate that CoT-CL achieves up to 10.15% absolute accuracy gain over state-of-the-art methods across multiple reasoning benchmarks, improves training efficiency by 30.62%, and significantly enhances robustness.

Technology Category

Application Category

📝 Abstract

Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning via contrastive learning with annotated CoT

Addressing model collapse and unstable training in RL fine-tuning

Balancing annotated CoT exploitation with potential CoT exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning with annotated CoT-based reinforcement

Novel contrastive signals to guide fine-tuning process

Stabilizes training with unsupervised learning signals

🔎 Similar Papers

No similar papers found.

Authors to Follow