Score: 0

Who is a Better Player: LLM against LLM

Published: August 5, 2025 | arXiv ID: 2508.04720v1

By: Yingjie Zhou , Jiezhang Cao , Farong Wen and more

Potential Business Impact:

Tests AI's smartness by playing board games.

Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs' skill play during games, warranting further explanation and exploration.

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

Artificial Intelligence

Tests how well AI plays and understands chess.

1 Dec 2025 3

90%

A Multi-Agent Pokemon Tournament for Evaluating Strategic Reasoning of Large Language Models

Artificial Intelligence

Lets computers play Pokemon battles like humans.

3 Aug 2025 0

90%

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Artificial Intelligence

Small AI models can now judge answers better.

20 Nov 2025 1

View PDF Login to Bookmark

Page Count

21 pages

Who is a Better Player: LLM against LLM

Tests AI's smartness by playing board games.

Technical Abstract

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

A Multi-Agent Pokemon Tournament for Evaluating Strategic Reasoning of Large Language Models

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation