Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Prior LLM evaluations for Korean college entrance exam (CSAT) mathematics suffer from data leakage and lack real-world multimodal and bilingual assessment. Method: We construct the first zero-data-leakage benchmark for the 2026 Korean CSAT mathematics, rigorously isolating training corpora from test items to ensure complete out-of-distribution evaluation; we administer real-time digital assessments across text, image, and mixed-modal inputs, and Korean–English bilingual prompts, under a practical framework balancing accuracy, inference cost, and latency. Contribution/Results: We establish a novel, exam-grounded paradigm for evaluating LLM mathematical reasoning. GPT-5 Codex achieves the sole perfect score; multiple models exceed 95/100; gpt-oss-20B delivers optimal cost–accuracy trade-off. Crucially, geometry problems emerge as a consistent weakness across models; while advanced reasoning techniques improve accuracy, they incur substantial computational overhead.

Technology Category

Application Category

📝 Abstract

This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on unseen 2026 Korean CSAT math exam to prevent data leakage

Assessing mathematical reasoning across text, image, and multimodal input modalities

Analyzing performance-cost tradeoffs in reasoning enhancement techniques for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digitized exam questions immediately after public release

Evaluated 24 LLMs across multiple input modalities

Assessed reasoning enhancement effects on performance efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow