Score: 2

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Published: December 15, 2025 | arXiv ID: 2512.13330v1

By: Joona Kytöniemi , Jousia Piha , Akseli Reunamo and more

Potential Business Impact:

Tests how well computers understand Finnish.

Business Areas:

A/B Testing Data and Analytics

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

Computation and Language

Tests computers on global money news.

16 Jun 2025 3

88%

XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

Computation and Language

Tests computers on hard money problems.

20 Aug 2025 2

88%

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Computational Engineering, Finance, and Science

Tests if AI follows money rules safely.

10 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇫🇮 Finland

Repos / Data Links

github.com github.com github.com github.com github.com github.com

Page Count

46 pages

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Tests how well computers understand Finnish.

Technical Abstract

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance