T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current text-to-video (T2V) models exhibit severe deficiencies in rendering on-screen text—such as subtitles and mathematical formulas—failing to ensure readability, spatial stability, and cross-frame consistency. Method: We introduce ScreenTextBench, the first human evaluation benchmark dedicated to screen-text fidelity, featuring a multidimensional scoring protocol across font rendering, semantic accuracy, spatial positioning, and temporal consistency. We design a prompt set integrating complex textual instructions with dynamic scenes to systematically evaluate ten mainstream open-source and commercial T2V models. Contribution/Results: ScreenTextBench fills a critical gap in evaluating precise on-screen text rendering. Experiments reveal consistent and significant failures across all evaluated models in mathematical formula generation, multilingual text, and long-text scenarios—exposing fundamental limitations in text-controllable video synthesis. The benchmark establishes a rigorous evaluation framework and identifies concrete directions for advancing controllable, text-accurate T2V generation.

Technology Category

Application Category

📝 Abstract

Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-video models' on-screen text accuracy

Assessing temporal consistency in generated video text

Identifying gaps in textual fidelity for video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces T2VTextBench for text fidelity evaluation

Tests models with complex text and dynamic scenes

Evaluates ten state-of-the-art text-to-video systems

🔎 Similar Papers

No similar papers found.

Authors to Follow