Score: 1

What Happens Next? Next Scene Prediction with a Unified Video Model

Published: December 15, 2025 | arXiv ID: 2512.13015v1

By: Xinjie Li , Zhimin Chen , Rui Zhao and more

BigTech Affiliations: Amazon

Potential Business Impact:

Helps computers guess what happens next in videos.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.