Scaling Down Text Encoders of Text-to-Image Diffusion Models

๐Ÿ“… 2025-03-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-image diffusion models commonly employ large text encoders (e.g., T5-XXL) to enhance prompt understanding and text fidelity, yet these models suffer from parameter redundancy, high computational cost, and weak responsiveness to non-visual prompts. Method: We propose the first vision-oriented text encoder knowledge distillation paradigm: (i) constructing a multi-objective distillation dataset balancing image quality, semantic fidelity, and text rendering capability; and (ii) designing a vision-guided distillation strategy. Using T5 scaling and fine-tuning, we compress T5-XXL into T5-baseโ€”reducing parameters by 50ร—. Contribution/Results: Our distilled encoder maintains comparable generation quality to the full T5-XXL on FLUX and SD3, while significantly reducing GPU memory footprint and inference latency. This work provides the first systematic empirical validation of large text encoder compressibility in visual generation tasks, establishing a new paradigm for efficient multimodal modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question:"Do we really need such a large text encoder?"In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.
Problem

Research questions and friction points this paper is trying to address.

Reducing text encoder size in diffusion models
Eliminating redundancy in non-visual prompt handling
Maintaining image quality with smaller encoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-based knowledge distillation for T5 encoder
Dataset based on image quality and semantics
50x smaller model with comparable performance
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Lifu Wang
JD Explore Academy, JD.com Inc., Georgia Institute of Technology
D
Daqing Liu
JD Explore Academy, JD.com Inc.
Xinchen Liu
Xinchen Liu
JD Explore Academy
Computer VisionMultimedia
X
Xiaodong He
JD Explore Academy, JD.com Inc.