Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior LLM evaluations for Korean college entrance exam (CSAT) mathematics suffer from data leakage and lack real-world multimodal and bilingual assessment. Method: We construct the first zero-data-leakage benchmark for the 2026 Korean CSAT mathematics, rigorously isolating training corpora from test items to ensure complete out-of-distribution evaluation; we administer real-time digital assessments across text, image, and mixed-modal inputs, and Korean–English bilingual prompts, under a practical framework balancing accuracy, inference cost, and latency. Contribution/Results: We establish a novel, exam-grounded paradigm for evaluating LLM mathematical reasoning. GPT-5 Codex achieves the sole perfect score; multiple models exceed 95/100; gpt-oss-20B delivers optimal cost–accuracy trade-off. Crucially, geometry problems emerge as a consistent weakness across models; while advanced reasoning techniques improve accuracy, they incur substantial computational overhead.

Technology Category

Application Category

📝 Abstract
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on unseen 2026 Korean CSAT math exam to prevent data leakage
Assessing mathematical reasoning across text, image, and multimodal input modalities
Analyzing performance-cost tradeoffs in reasoning enhancement techniques for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digitized exam questions immediately after public release
Evaluated 24 LLMs across multiple input modalities
Assessed reasoning enhancement effects on performance efficiency
🔎 Similar Papers
No similar papers found.
G
Goun Pyeon
Department of Computer Science & Engineering, Chungnam National University
I
Inbum Heo
Department of Computer Science & Engineering, Chungnam National University
Jeesu Jung
Jeesu Jung
Chungnam National University
Natural Language Processing
T
Taewook Hwang
Department of Computer Science & Engineering, Chungnam National University
H
Hyuk Namgoong
Department of Computer Science & Engineering, Chungnam National University
H
Hyein Seo
Department of Computer Science & Engineering, Chungnam National University
Y
Yerim Han
Department of Computer Science & Engineering, Chungnam National University
E
Eunbin Kim
Department of Computer Science & Engineering, Chungnam National University
H
Hyeonseok Kang
Department of Computer Science & Engineering, Chungnam National University
Sangkeun Jung
Sangkeun Jung
Chungnam National University
artificial intelligencenatural language processingmachine learning