Score: 1

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Published: November 10, 2025 | arXiv ID: 2511.07378v1

By: Yu Huang , Zixin Wen , Aarti Singh and more

Potential Business Impact:

AI learns to solve harder problems with longer thinking.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $\mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$, unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Machine Learning (CS)

Helps computers solve problems faster by thinking in parallel.

18 May 2025 3

90%

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Machine Learning (CS)

Teaches computers to solve problems step-by-step.

11 Aug 2025 0

90%

Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization

Computation and Language

Teaches computers to think step-by-step to solve problems.

24 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

158 pages

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

AI learns to solve harder problems with longer thinking.

Technical Abstract

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization