Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the high acquisition cost, privacy constraints, and domain-specific data scarcity associated with real images in vision-language model (VLM) training, this paper proposes the Text-Printed Image (TPI) paradigm: text descriptions are directly rendered as monochromatic textual images on a pure white background, enabling zero-cost construction of semantically faithful synthetic image-text pairs. TPI bypasses generative models entirely, relying solely on lightweight text rendering, and integrates seamlessly into existing VLM training pipelines. Coupled with large language models to generate diverse, high-quality captions, TPI facilitates purely text-driven vision-language pretraining and data augmentation. Comprehensive evaluation across four state-of-the-art VLMs and seven benchmarks demonstrates that TPI significantly outperforms diffusion-based, text-centric training approaches—achieving superior generalization, practicality, and scalability while preserving semantic fidelity and reducing computational and ethical overhead.

Technology Category

Application Category

📝 Abstract

Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses the high cost of collecting image-text pairs for training large vision-language models.

Bridges the image-text modality gap when only textual data is available for training.

Proposes a low-cost method to generate synthetic images from text for effective model training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Printed Image renders text on white canvas

TPI bridges image-text modality gap cheaply

Synthetic images preserve text semantics without diffusion models

🔎 Similar Papers

No similar papers found.