ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing benchmarks inadequately assess large language models’ (LLMs) domain-specific knowledge and higher-order reasoning capabilities in complex academic tasks. To address this, we introduce AcadBench—the first bilingual (English–Korean) academic benchmark—covering eight disciplines and five categories of intricate problems requiring abstract comprehension and multi-step reasoning. Samples are rigorously generated based on disciplinary methodologies and discourse structures. Our evaluation framework uniquely integrates domain depth, logical complexity, and bilingual alignment, enabling the first systematic assessment of LLMs’ cross-disciplinary abstraction and multi-step reasoning within a unified benchmark. We employ domain-customized prompting, bilingual parallel construction, and multidimensional capability decomposition for annotation. AcadBench comprises 10,340 high-quality instances (5,309 English + 5,031 Korean). State-of-the-art model o3-mini achieves only a mean score of 0.543, confirming the benchmark’s high difficulty and strong discriminative power.

Technology Category

Application Category

📝 Abstract

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce exttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. exttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, exttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' academic reasoning in complex, specialized contexts

Assesses abstraction, comprehension, and reasoning across eight research domains

Provides bilingual (English-Korean) benchmark for linguistic and academic capability evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark for academic reasoning evaluation

Three-step process for high-quality data construction

Domain-specific problem types and methodologies

🔎 Similar Papers

No similar papers found.

Authors to Follow