FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses the longstanding challenge in diffusion models of simultaneously achieving high typographic fidelity and legibility in text generation. The authors propose a plug-and-play conditional control framework that integrates hierarchical text-font token representations, spatially anchored position-aware embeddings, and a multi-level token dropout strategy within a Diffusion Transformer (DiT) architecture. By fusing dual encoders—DeepFont and DINOv2—the method enriches font representation without requiring retraining of the base DiT model. Experiments demonstrate a 76% relative performance improvement over single-encoder baselines on decorative font generation, along with 68–76% gains in font consistency compared to unconditional models, effectively balancing visual fidelity and readability while enabling seamless integration into existing DiT pipelines.

📝 Abstract

Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

Problem

Research questions and friction points this paper is trying to address.

typography generation

diffusion models

font control

text legibility

typographic fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

typographic conditioning

diffusion models

hierarchical token representation