Score: 5

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Published: December 11, 2025 | arXiv ID: 2512.10927v1

By: Yulu Gan , Ligeng Zhu , Dandan Shan and more

BigTech Affiliations: Massachusetts Institute of Technology Stanford University University of California, Berkeley

Potential Business Impact:

Teaches computers to understand how things move.

Business Areas:
Motion Capture Media and Entertainment, Video

Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
20 pages

Category
Computer Science:
CV and Pattern Recognition