🤖 AI Summary
Existing diffusion Transformer (DiT)-based text-to-image models suffer from inconsistent, drifting, and coarse-grained control over word-level typography, layout, and style. To address this, we propose the first word-level controllable text rendering framework. Our method introduces enclosing typography control tokens (ETC-tokens) for fine-grained layout modeling; parameter-efficient typography control fine-tuning (TC-FT), which updates only 5% of model parameters; and a text-agnostic style control adapter (SCA). We further construct HTML-render, the first large-scale, word-level annotated controllable dataset, synthesized via HTML-based rendering. Experiments demonstrate significant improvements in word-level font consistency, layout controllability, and style stability, outperforming state-of-the-art methods across multiple text rendering metrics. All code, models, and the HTML-render dataset are publicly released.
📝 Abstract
Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on $5%$ key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.