Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
By: Alexander Duffy , Samuel J Paech , Ishana Shastri and more
Potential Business Impact:
Lets computers play complex strategy games.
We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.
Similar Papers
DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy
Artificial Intelligence
AI learns to play complex strategy games better.
Tracking World States with Language Models: State-Based Evaluation Using Chess
Artificial Intelligence
Tests if computers understand game rules deeply.
Measuring Fine-Grained Negotiation Tactics of Humans and LLMs in Diplomacy
Computers and Society
Teaches computers to negotiate like people.