Score: 0

SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

Published: August 14, 2025 | arXiv ID: 2508.10428v1

By: Pengbo Shen , Yaqing Wang , Ni Mu and more

Potential Business Impact:

Helps AI learn to play complex games better.

Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI's ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game's full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting high-quality training samples. Comprehensive analysis using SC2Arena provides valuable insights into developing generalist agents that were not possible with previous benchmarks. Experimental results also demonstrate that our proposed StarEvolve achieves superior performance in strategic planning. Our code, environment, and algorithms are publicly available.

SC2Tools: StarCraft II Toolset and Dataset API

Software Engineering

Helps build big game data for AI research.

22 Sep 2025 0

88%

Adaptive Command: Real-Time Policy Adjustment via Language Models in StarCraft II

Human-Computer Interaction

Helps players win video games using talking.

5 Aug 2025 1

88%

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

Artificial Intelligence

Tests AI smarts with board games.

5 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

28 pages

SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

Helps AI learn to play complex games better.

Technical Abstract

SC2Tools: StarCraft II Toolset and Dataset API

Adaptive Command: Real-Time Policy Adjustment via Language Models in StarCraft II

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play