Score: 0

Next-Embedding Prediction Makes Strong Vision Learners

Published: December 18, 2025 | arXiv ID: 2512.16922v1

By: Sihan Xu , Ziqiao Ma , Wenhao Chai and more

Potential Business Impact:

Teaches computers to see by predicting picture parts.

Business Areas:

Image Recognition Data and Analytics, Software

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

CV and Pattern Recognition

Teaches computers to understand videos like humans.

1 Dec 2025 1

89%

NEP: Autoregressive Image Editing via Next Editing Token Prediction

CV and Pattern Recognition

Changes pictures exactly where you tell it to.

8 Aug 2025 1

88%

Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

CV and Pattern Recognition

Computers will soon draw pictures by guessing each dot.

11 Nov 2025 0

View PDF Login to Bookmark

Page Count

15 pages

Next-Embedding Prediction Makes Strong Vision Learners

Teaches computers to see by predicting picture parts.

Technical Abstract

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

NEP: Autoregressive Image Editing via Next Editing Token Prediction

Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?