Score: 1

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Published: June 2, 2025 | arXiv ID: 2506.01341v1

By: Yiran Zhang , Mo Wang , Xiaoyang Li and more

Potential Business Impact:

Tests computers' ability to solve puzzles over time.

Business Areas:

A/B Testing Data and Analytics

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Computation and Language

Tests how well AI remembers stories and makes choices.

16 Jun 2025 0

90%

TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Computation and Language

Smart computers fail at simple games.

11 Jun 2025 3

90%

Computational Reasoning of Large Language Models

Computation and Language

Tests if AI can follow rules and solve problems.

29 Apr 2025 0

View PDF Login to Bookmark

Page Count

24 pages

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Tests computers' ability to solve puzzles over time.

Technical Abstract

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Computational Reasoning of Large Language Models