IGD: Instructional Graphic Design with Multimodal Layer Generation

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated graphic design methods face two key bottlenecks: traditional two-stage pipelines lack intelligence and creativity, while diffusion-based approaches generate only non-editable pixel-level images with blurry text rendering and limited practicality. This paper proposes the first natural language–driven, editable multimodal layer generation framework, integrating multimodal large language models (MLLMs) and diffusion models in an end-to-end jointly trained architecture. We introduce a novel paradigm of parameterized rendering coupled with image asset co-generation: the MLLM parses user instructions to predict layer attributes and layout structure, while the diffusion model synthesizes high-fidelity visual content. Experiments across diverse design scenarios demonstrate significant improvements over state-of-the-art methods, enabling efficient generation of high-fidelity, semantically aligned, fully editable vector and layered design files—effectively bridging creative flexibility and engineering practicality.

Technology Category

Application Category

📝 Abstract
Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.
Problem

Research questions and friction points this paper is trying to address.

Automates graphic design with editable layers via natural language instructions
Improves text legibility and creativity in diffusion-based design generation
Enables scalable, extensible end-to-end training for complex design tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates editable multimodal layers via natural language
Uses parametric rendering and image asset generation
Leverages MLLM for layer prediction and layout
🔎 Similar Papers
No similar papers found.