TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing diffusion models for text-to-image generation suffer from inaccurate text embedding—manifesting as spelling errors, semantic mismatches, and visual incoherence—and lack dedicated evaluation benchmarks for text fidelity. Method: We introduce TextEmbedBench, the first visual-text generation benchmark explicitly designed to assess text embedding capability, underpinned by a “text attribute × prompt complexity” dual-axis evaluation framework covering readability, semantic consistency, and visual coherence. Leveraging systematic analysis of diffusion-VAE architectural coupling, customized prompt engineering, diverse text sampling, and structured error attribution, we identify the VAE decoder as the primary bottleneck for text fidelity. Contribution/Results: Our experiments uncover prevalent deficiencies—including spelling errors and contextual misalignment—and establish a reproducible, decomposable evaluation protocol with actionable, targeted optimization pathways for improving text embedding robustness.

Technology Category

Application Category

📝 Abstract

Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text embedding accuracy in diffusion-based image models.

Lack of benchmarks for visual text generation in AI models.

Challenges in spelling, context, and coherence in text-to-image generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

TextInVision benchmark evaluates text embedding in images

Diverse prompts test diffusion and VAE model performance

Identifies common errors in AI-generated visual text

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling