🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) domain-specific knowledge and higher-order reasoning capabilities in complex academic tasks. To address this, we introduce AcadBench—the first bilingual (English–Korean) academic benchmark—covering eight disciplines and five categories of intricate problems requiring abstract comprehension and multi-step reasoning. Samples are rigorously generated based on disciplinary methodologies and discourse structures. Our evaluation framework uniquely integrates domain depth, logical complexity, and bilingual alignment, enabling the first systematic assessment of LLMs’ cross-disciplinary abstraction and multi-step reasoning within a unified benchmark. We employ domain-customized prompting, bilingual parallel construction, and multidimensional capability decomposition for annotation. AcadBench comprises 10,340 high-quality instances (5,309 English + 5,031 Korean). State-of-the-art model o3-mini achieves only a mean score of 0.543, confirming the benchmark’s high difficulty and strong discriminative power.
📝 Abstract
Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce exttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. exttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, exttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.