Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing benchmarks for web code generation primarily focus on static pages and lack evaluation of complex interactive behaviors and interaction consistency between generated and reference pages. To address this gap, this work proposes WebIGBench—the first benchmark specifically designed for interactive web code generation—encompassing 103 real-world websites, five common interaction types, and 871 concrete user actions. By integrating human-designed interaction paths with UI automation techniques, the benchmark establishes a multimodal large model evaluation framework equipped with structured metrics. Notably, it introduces the novel dimension of interaction consistency, filling a critical void in evaluating dynamic front-end generation. Systematic assessments using WebIGBench reveal the performance limits of current models on complex interactive tasks, thereby advancing the development of intelligent front-end generation technologies.

📝 Abstract

Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generation, catalyzing a new paradigm for front-end development. In particular, these models can directly transform visual designs into executable code, significantly improving the efficiency and adaptability of web development. Modern web applications are dynamic and interactive, featuring frequent user-page interactions. However, existing benchmarks largely evaluate the code generation of static webpages, ignoring the complex interactive behaviors in real-world applications. Besides, their evaluation criteria remain confined to visual fidelity and code structure, overlooking the interaction consistency between the generated and the reference webpages. To address these limitations, we introduce WebIGBench, the first benchmark designed to evaluate code generation for interactive webpages with complex interactions. By combining manually designed interaction paths with UI automation, we collected 103 complex webpages from real-world websites. This benchmark covers 5 popular interactive action types (e.g., click, input) involving 871 distinct interactive actions. Moreover, we propose a novel evaluation pipeline to address the gap in automated assessment of interactive actions. Extensive experiments on several representative MLLMs reveal the performance boundaries of current models in interactive webpage code generation using WebIGBench. The proposed benchmark is available at https://github.com/anoa12159-hue/WebIGBench_eval.

Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs

code generation

interactive webpages

benchmarking

interaction consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive webpage

code generation

multimodal LLM

benchmark

UI automation

🔎 Similar Papers

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

2024-03-05Citations: 0

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

2024-06-24arXiv.orgCitations: 10