DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing frontend UI code generation benchmarks suffer from three key limitations: narrow framework coverage (lacking mainstream frameworks such as React, Vue, and Angular), task monotony (focusing solely on generation while neglecting iterative tasks like editing and repair), and coarse-grained evaluation (lacking difficulty modeling, context sensitivity, and fine-grained code-level analysis). To address these gaps, we introduce the first MLLM-oriented frontend code benchmark supporting multiple frameworks (React, Vue, Angular, HTML) and multiple tasks (generation, editing, repair), comprising 900 web pages, 11 semantic themes, 9 edit types, and 6 defect categories. Our benchmark features a novel multi-level automated evaluation—assessing syntactic correctness, functional behavior, and visual fidelity—as well as structured task annotation and controllable difficulty sampling. Extensive experiments systematically expose significant MLLM disparities in framework adaptability, task-specific bottlenecks (especially in editing and repair), and input sensitivity. All data and tools are open-sourced for full reproducibility.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs' framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at https://github.com/WebPAI/DesignBench.

Problem

Research questions and friction points this paper is trying to address.

Lack of mainstream frameworks in current UI code benchmarks

Limited evaluation scope excluding iterative UI development tasks

Unidimensional assessment missing task difficulty and code analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-framework benchmark for MLLM front-end evaluation

Includes generation, edit, and repair tasks

Detailed analysis across multiple dimensions

🔎 Similar Papers

No similar papers found.