Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work investigates how supervised fine-tuning (SFT) hierarchically enhances large language models’ mathematical reasoning capabilities. Method: Leveraging the AIME24 dataset, we propose the first four-tier difficulty ladder (Easy → Medium → Hard → Extremely Hard), integrating fine-grained error analysis, step-level chain-of-thought (CoT) accuracy tracking, systematic SFT scale/quality comparisons, and difficulty-aware clustering-based attribution. Contribution/Results: We identify distinct capability transition mechanisms: (i) Medium-tier proficiency emerges robustly with only 500–1K high-quality examples; (ii) Hard-tier CoT step accuracy saturates at ~65%, exposing persistent intermediate reasoning bottlenecks; (iii) Extremely Hard problems reveal fundamental deficits in non-routine reasoning. Crucially, scaling SFT data volume yields significantly greater gains than curating smaller, highly selective datasets. The study establishes a verifiable, difficulty-graded evaluation framework for mathematical reasoning and provides an empirically grounded, tiered training roadmap.

Technology Category

Application Category

📝 Abstract

Recent supervised fine-tuning (SFT) approaches have significantly improved language models' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Understand how SFT enhances reasoning in language models

Identify difficulty tiers and requirements for reasoning progression

Assess limitations of current models on unconventional problem-solving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning enhances small-scale models

R1 reasoning style advances Easy to Medium tiers

Dataset size scaling more effective than curation

🔎 Similar Papers

No similar papers found.