Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

📅 2024-03-05

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work evaluates multimodal large language models (MLLMs) on the Design2Code task—end-to-end generation of renderable frontend code from real-world webpage screenshots. To this end, we introduce the first real-web-oriented Design2Code benchmark, comprising 484 manually curated examples, and propose the first reproducible multimodal frontend generation evaluation framework. Our framework integrates automated metrics—assessing layout, element, and styling consistency—with human evaluation. Methodologically, we innovate with pixel-level and DOM-structure alignment, automated rendering validation, and systematic multi-model assessment (GPT-4o, GPT-4V, Gemini, Claude). Experimental results reveal fundamental bottlenecks in current MLLMs: poor visual element recall and inaccurate layout generation. We provide a fine-grained performance decomposition across modalities and code components, offering empirical foundations for both model improvement and standardized evaluation protocol development.

Technology Category

Application Category

📝 Abstract

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language models (MLLMs) directly convert visual designs into code implementations. In this work, we construct Design2Code - the first real-world benchmark for this task. Specifically, we manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations to validate the performance ranking. To rigorously benchmark MLLMs, we test various multimodal prompting methods on frontier models such as GPT-4o, GPT-4V, Gemini, and Claude. Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.

Problem

Research questions and friction points this paper is trying to address.

Multimodal code generation for front-end engineering

Benchmarking AI models for webpage design conversion

Evaluating visual element recall and layout accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs convert designs to code

Design2Code benchmarks real-world webpages

Automatic and human evaluations assess performance

🔎 Similar Papers

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach