PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
By: Zeqing Wang, Keze Wang, Lei Zhang
Potential Business Impact:
Teaches AI to spot videos breaking physics rules.
Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
Similar Papers
PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement
CV and Pattern Recognition
Helps computers understand how things move in videos.
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Machine Learning (CS)
Makes computer videos follow real-world physics rules.
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement
CV and Pattern Recognition
Makes videos follow real-world physics rules.