Score: 0

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

Published: September 23, 2025 | arXiv ID: 2509.19517v1

By: Sai Teja Reddy Adapala

Potential Business Impact:

AI struggles with too much information.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.

Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models

Machine Learning (CS)

Makes AI think smarter, using less energy.

1 Jul 2025 0

89%

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

Artificial Intelligence

Helps AI solve harder problems by working together.

7 Jun 2025 0

89%

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Computation and Language

Tests AI's thinking power with harder puzzles.

22 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

15 pages

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

AI struggles with too much information.

Technical Abstract

Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density