FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Finnish large language models lack a unified, robust evaluation benchmark. Method: We introduce the first open-source Finnish evaluation suite covering five core tasks—reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment—supporting both generative and multiple-choice assessment. We propose a novel, four-dimensional automated task selection metric based on learning curves (monotonicity, signal-to-noise ratio, non-randomness, and model-order consistency) and integrate human-verified machine-translated resources to ensure cultural appropriateness and reliability. The suite adopts Hugging Face Datasets format, includes five prompt variants per task, and is validated via benchmarking and cross-prompt analysis on an instruction-tuned 2.15B-parameter pre-trained model. Contribution/Results: Our suite significantly improves consistency, reproducibility, and task discriminability in Finnish LLM evaluation, demonstrating robustness across multiple state-of-the-art models.

Technology Category

Application Category

📝 Abstract
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
Problem

Research questions and friction points this paper is trying to address.

Evaluates Finnish large language models across multiple tasks
Consolidates and standardizes Finnish benchmark datasets
Selects robust tasks using model learning curve analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark suite for Finnish language models
Human annotation for machine-translated evaluation resources
Robust task selection using pretrained model learning curves
🔎 Similar Papers
No similar papers found.
J
Joona Kytöniemi
TurkuNLP, Department of Computing, University of Turku, Finland
J
Jousia Piha
TurkuNLP, Department of Computing, University of Turku, Finland
A
Akseli Reunamo
TurkuNLP, Department of Computing, University of Turku, Finland
F
Fedor Vitiugin
TurkuNLP, Department of Computing, University of Turku, Finland
Farrokh Mehryary
Farrokh Mehryary
University of Turku
Natural Language Processingtext miningdeep learningbioinformatics
Sampo Pyysalo
Sampo Pyysalo
University of Turku