LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts
By: Junhao Chen , Jingbo Sun , Xiang Li and more
Potential Business Impact:
Tests AI's smartness in games.
As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory-based evaluation platform that measures LLMs' decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth. Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios. The benchmark and rankings are publicly available at https://llmsparks.github.io/.
Similar Papers
Who is a Better Player: LLM against LLM
Artificial Intelligence
Tests AI's smartness by playing board games.
Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications
Computation and Language
Tests AI on many tasks, not just one.
Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play
Artificial Intelligence
Tests AI smarts with board games.