Score: 0

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Published: April 17, 2025 | arXiv ID: 2504.13122v1

By: Haojian Huang , Haodong Chen , Shengqiong Wu and more

Potential Business Impact:

Makes AI understand videos better, like people do.

Business Areas:

Image Recognition Data and Analytics, Software

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models

CV and Pattern Recognition

Makes AI better at understanding medical pictures.

25 Jan 2026 1

91%

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

CV and Pattern Recognition

Makes AI videos move better with less data.

4 Jun 2025 0

91%

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

CV and Pattern Recognition

Makes AI videos look more real and flow better.

7 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Page Count

20 pages

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Makes AI understand videos better, like people do.

Technical Abstract

Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models