Score: 0

VFMF: World Modeling by Forecasting Vision Foundation Model Features

Published: December 12, 2025 | arXiv ID: 2512.11225v1

By: Gabrijel Boduljak , Yushi Lan , Christian Rupprecht and more

Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

CV and Pattern Recognition

Combines old AI to make new, smarter AI.

20 Aug 2025 0

89%

Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

CV and Pattern Recognition

Combines old models to make new smart vision.

20 Aug 2025 0

88%

Temporal-Guided Visual Foundation Models for Event-Based Vision

CV and Pattern Recognition

Lets cameras see better in tough conditions.

9 Nov 2025 4

View PDF Login to Bookmark

VFMF: World Modeling by Forecasting Vision Foundation Model Features

Technical Abstract

Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

Temporal-Guided Visual Foundation Models for Event-Based Vision