FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing Finnish large language models lack a unified, robust evaluation benchmark. Method: We introduce the first open-source Finnish evaluation suite covering five core tasks—reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment—supporting both generative and multiple-choice assessment. We propose a novel, four-dimensional automated task selection metric based on learning curves (monotonicity, signal-to-noise ratio, non-randomness, and model-order consistency) and integrate human-verified machine-translated resources to ensure cultural appropriateness and reliability. The suite adopts Hugging Face Datasets format, includes five prompt variants per task, and is validated via benchmarking and cross-prompt analysis on an instruction-tuned 2.15B-parameter pre-trained model. Contribution/Results: Our suite significantly improves consistency, reproducibility, and task discriminability in Finnish LLM evaluation, demonstrating robustness across multiple state-of-the-art models.

Technology Category

Application Category

📝 Abstract

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

Problem

Research questions and friction points this paper is trying to address.

Evaluates Finnish large language models across multiple tasks

Consolidates and standardizes Finnish benchmark datasets

Selects robust tasks using model learning curve analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark suite for Finnish language models

Human annotation for machine-translated evaluation resources

Robust task selection using pretrained model learning curves

🔎 Similar Papers

No similar papers found.

Authors to Follow