🤖 AI Summary
Existing Finnish large language models lack a unified, robust evaluation benchmark.
Method: We introduce the first open-source Finnish evaluation suite covering five core tasks—reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment—supporting both generative and multiple-choice assessment. We propose a novel, four-dimensional automated task selection metric based on learning curves (monotonicity, signal-to-noise ratio, non-randomness, and model-order consistency) and integrate human-verified machine-translated resources to ensure cultural appropriateness and reliability. The suite adopts Hugging Face Datasets format, includes five prompt variants per task, and is validated via benchmarking and cross-prompt analysis on an instruction-tuned 2.15B-parameter pre-trained model.
Contribution/Results: Our suite significantly improves consistency, reproducibility, and task discriminability in Finnish LLM evaluation, demonstrating robustness across multiple state-of-the-art models.
📝 Abstract
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.