🤖 AI Summary
SVG understanding and generation face dual challenges: precise vector code modeling and effective fusion of multimodal conditions (text/image). To address these, we propose UniSVG—the first unified SVG dataset tailored for multimodal large language models (MLLMs), comprising 525K high-quality samples enabling bidirectional cross-modal translation: text→SVG, image→SVG, and SVG→semantic. Methodologically, we introduce floating-point parameterized path modeling, cross-modal alignment training, and conditional generation mechanisms to significantly improve structural fidelity in vector representation. Evaluated on multiple SVG understanding and generation benchmarks, UniSVG outperforms closed-source models including GPT-4V and substantially boosts the performance of open-source MLLMs. We fully open-source the dataset, model weights, training/inference code, and evaluation benchmarks—establishing a standardized, reproducible foundation to advance open research in AI for vector graphics.
📝 Abstract
Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.