KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

๐Ÿ“… 2026-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the absence of human-grounded difficulty signals in existing mathematical reasoning benchmarks, which hinders the evaluation of whether model errors align with human cognitive difficulty. The authors introduce KCSAT-ML, a benchmark comprising 664 Korean college entrance exam mathematics problems, 339 of which are annotated with official error rates derived from hundreds of thousands of test-takersโ€”marking the first integration of real human response data into AI reasoning evaluation. They propose Difficulty-aligned Reasoning Gain (DRG), an orthogonally informative metric to accuracy that reveals misalignment patterns across problem difficulties. Experiments show that low-budget models exhibit pronounced performance degradation on questions with high human error rates; test-time scaling displays inverse scaling on the hardest items and overthinking on easier ones; and models achieving comparable accuracy can demonstrate markedly divergent error behaviors under DRG.
๐Ÿ“ Abstract
Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.
Problem

Research questions and friction points this paper is trying to address.

math reasoning
human difficulty
benchmark
reasoning models
error rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

KCSAT-ML
Difficulty-aligned Reasoning Gain
human difficulty calibration
test-time scaling
reasoning benchmark