HRET: A Self-Evolving LLM Evaluation Toolkit for Korean

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of a unified, reproducible, and standardized evaluation framework for Korean large language models (LLMs) hinders result comparability and verification. Method: This paper introduces the first open-source, self-evolving evaluation toolkit specifically designed for Korean LLMs. It features a novel modular registration architecture and an automated continuous evolution mechanism, supporting diverse evaluation paradigms—including log-probability scoring, exact match, linguistic inconsistency penalty, and LLM-as-a-judge. The toolkit is compatible with vLLM, Hugging Face, and OpenAI APIs, seamlessly integrates major Korean benchmarks (e.g., HAE-RAE Bench and KMMLU), and enables cross-backend collaborative evaluation. Contribution/Results: This work fills a critical gap in standardized Korean LLM evaluation, significantly enhancing fairness, transparency, and reproducibility. It has already supported multiple state-of-the-art Korean LLM development efforts.

Technology Category

Application Category

📝 Abstract
Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized evaluation framework for Korean LLMs
Inconsistent results and limited comparability in benchmarks
Need for reproducible and transparent Korean NLP research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation methods for Korean LLMs
Modular registry-based architecture integration
Automated pipelines for continuous evolution
🔎 Similar Papers
No similar papers found.