Score: 1

Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilities of Large Language Models via Game Play

Published: August 5, 2025 | arXiv ID: 2508.03368v3

By: Lucia Cipolina-Kun, Marianna Nezhurina, Jenia Jitsev

Potential Business Impact:

Tests how smart AI plays games.

The Game Reasoning Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, heuristic, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via liteLLM, local model deployment via vLLM, and offers distributed execution through Ray. This paper summarises the library structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game theoretic behaviour.

Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play

Artificial Intelligence

Tests how smart AI plays games to improve it.

5 Aug 2025 1

96%

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

Artificial Intelligence

Tests AI smarts with board games.

5 Aug 2025 0

90%

LongReasonArena: A Long Reasoning Benchmark for Large Language Models

Computation and Language

Tests if computers can think through long problems.

26 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

21 pages

Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilities of Large Language Models via Game Play

Tests how smart AI plays games.

Technical Abstract

Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

LongReasonArena: A Long Reasoning Benchmark for Large Language Models