🤖 AI Summary
Text-to-SVG generation faces two key challenges: poor generalization and weak instruction following. To address these, we propose a reasoning-driven instruction alignment framework that explicitly models the visual reasoning process, enabling stepwise generation of complete, editable, and structurally coherent SVG primitives. Our method integrates large language models with multimodal understanding, incorporating staged code generation and supervised fine-tuning. Crucially, we leverage multimodal annotated data to expose and supervise the chain-of-thought reasoning, thereby enhancing reasoning consistency and mitigating hallucination. Experiments demonstrate that our approach significantly outperforms existing methods in generation stability, editability, and visual fidelity—while preserving the inherent advantages of vector graphics—thus advancing the practical deployment of automated graphic design systems.
📝 Abstract
Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.