Video Text Preservation with Synthetic Text-Rich Videos

πŸ“… 2025-11-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text-to-video (T2V) models suffer from pervasive issues including blurry text rendering, poor legibility of short words/phrases, structural incoherence in long prompts, and weak temporal consistency; existing correction methods incur high computational overhead and lack seamless integration into T2V generation pipelines. This paper proposes a lightweight weakly supervised fine-tuning framework: first, leveraging a text-agnostic image-to-video (I2V) model to animate text-containing images synthesized by a pre-trained text-to-image (T2I) diffusion model, thereby constructing high-fidelity synthetic video–prompt pairs; second, fine-tuning a pre-trained T2V model (Wan2.1) exclusively on this data without architectural modification. The approach significantly improves short-text clarity and long-text structural fidelity while enhancing inter-frame temporal coherence. It achieves superior generation quality with minimal computational cost, offering a scalable, low-overhead solution to T2V text readability.

Technology Category

Application Category

πŸ“ Abstract
While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.
Problem

Research questions and friction points this paper is trying to address.

Improving legibility of generated text in video synthesis
Enhancing temporal consistency for text in videos
Reducing computational cost for text-rich video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic text-rich images from T2I model
Animation into videos via text-agnostic I2V
Fine-tuning T2V model with synthetic video-prompt pairs