🤖 AI Summary
The absence of a unified, reproducible, and standardized evaluation framework for Korean large language models (LLMs) hinders result comparability and verification. Method: This paper introduces the first open-source, self-evolving evaluation toolkit specifically designed for Korean LLMs. It features a novel modular registration architecture and an automated continuous evolution mechanism, supporting diverse evaluation paradigms—including log-probability scoring, exact match, linguistic inconsistency penalty, and LLM-as-a-judge. The toolkit is compatible with vLLM, Hugging Face, and OpenAI APIs, seamlessly integrates major Korean benchmarks (e.g., HAE-RAE Bench and KMMLU), and enables cross-backend collaborative evaluation. Contribution/Results: This work fills a critical gap in standardized Korean LLM evaluation, significantly enhancing fairness, transparency, and reproducibility. It has already supported multiple state-of-the-art Korean LLM development efforts.
📝 Abstract
Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.