🤖 AI Summary
This work addresses the longstanding challenge in diffusion models of simultaneously achieving high typographic fidelity and legibility in text generation. The authors propose a plug-and-play conditional control framework that integrates hierarchical text-font token representations, spatially anchored position-aware embeddings, and a multi-level token dropout strategy within a Diffusion Transformer (DiT) architecture. By fusing dual encoders—DeepFont and DINOv2—the method enriches font representation without requiring retraining of the base DiT model. Experiments demonstrate a 76% relative performance improvement over single-encoder baselines on decorative font generation, along with 68–76% gains in font consistency compared to unconditional models, effectively balancing visual fidelity and readability while enabling seamless integration into existing DiT pipelines.
📝 Abstract
Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.