Score: 0

LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

Published: September 20, 2025 | arXiv ID: 2509.16610v1

By: Junhao Chen , Jingbo Sun , Xiang Li and more

Potential Business Impact:

Tests AI's smartness in games.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory-based evaluation platform that measures LLMs' decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth. Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios. The benchmark and rankings are publicly available at https://llmsparks.github.io/.

Who is a Better Player: LLM against LLM

Artificial Intelligence

Tests AI's smartness by playing board games.

5 Aug 2025 0

90%

Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications

Computation and Language

Tests AI on many tasks, not just one.

5 Mar 2025 0

89%

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

Artificial Intelligence

Tests AI smarts with board games.

5 Aug 2025 0

View PDF Login to Bookmark

Page Count

13 pages

LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

Tests AI's smartness in games.

Technical Abstract

Who is a Better Player: LLM against LLM

Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play