Score: 1

ViT$^3$: Unlocking Test-Time Training in Vision

Published: December 1, 2025 | arXiv ID: 2512.01643v1

By: Dongchen Han , Yining Li , Tianyu Li and more

Potential Business Impact:

Makes computers understand pictures faster and better.

Business Areas:

Image Recognition Data and Analytics, Software

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

CV and Pattern Recognition

Makes AI better at understanding videos by looking closer.

25 Sep 2025 1

88%

Test time training enhances in-context learning of nonlinear functions

Machine Learning (Stat)

Helps AI learn new things faster, even when they change.

30 Sep 2025 0

88%

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

CV and Pattern Recognition

Makes AI videos follow instructions better.

9 Oct 2025 1

View PDF Login to Bookmark

Page Count

13 pages

ViT$^3$: Unlocking Test-Time Training in Vision

Makes computers understand pictures faster and better.

Technical Abstract

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Test time training enhances in-context learning of nonlinear functions

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation