M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games
By: Sixiong Xie , Zhuofan Shi , Haiyang Shen and more
As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation. However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents' decision reasoning and communicative interactions. To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis). Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents' personality traits and capability profiles beyond simple task scores or outcome-based metrics. Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.
Similar Papers
Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance
Artificial Intelligence
Makes AI agents follow rules and complete tasks.
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation
Artificial Intelligence
AI agents learn to agree and persuade each other.
SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
Computation and Language
Tests how well AI understands people talking.