Score: 1

T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Published: May 8, 2025 | arXiv ID: 2505.04946v1

By: Xuyang Guo , Jiayan Huo , Zhenmei Shi and more

Potential Business Impact:

Makes videos show words correctly.

Business Areas:

Text Analytics Data and Analytics, Software

Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Artificial Intelligence

Makes videos match stories better.

18 Feb 2025 2

92%

T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

CV and Pattern Recognition

Tests if AI videos understand how the world works.

24 Jul 2025 0

92%

T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos

CV and Pattern Recognition

Helps check if computer-made videos match their words.

15 Jan 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

30 pages

T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Makes videos show words correctly.

Technical Abstract

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos