S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
By: Wenyuan Zhang , Shuaiyi Nie , Xinghua Zhang and more
Potential Business Impact:
Tests if AI can think fast, not just slow.
We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.
Similar Papers
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
CV and Pattern Recognition
Tests AI's smart thinking on hard problems.
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Computation and Language
Smart computers fail at simple games.
RiddleBench: A New Generative Reasoning Benchmark for LLMs
Computation and Language
Tests AI's smart thinking, finds it struggles.