FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models
By: Joona Kytöniemi , Jousia Piha , Akseli Reunamo and more
Potential Business Impact:
Tests how well computers understand Finnish.
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
Similar Papers
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
Computation and Language
Tests computers on global money news.
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning
Computation and Language
Tests computers on hard money problems.
CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance
Computational Engineering, Finance, and Science
Tests if AI follows money rules safely.