🤖 AI Summary
This work addresses the limitation of existing evaluation methods that rely solely on local evidence and thus fail to verify whether webpages generated by multimodal large language models (MLLMs) fulfill the implicit states and interactive behaviors required by user tasks. To overcome this, the paper proposes Interaction Contract Graphs (ICGs), which formalize task requirements into observable states, user intent transitions, and DOM/visual assertions, enabling implementation-agnostic, browser-level behavioral validation. ICG introduces, for the first time, a requirement-driven state-transition modeling mechanism that distinguishes explicit functionalities from implicit constraints and supports fine-grained evaluation across five input modalities. Evaluation across 442 tasks on 14 MLLMs reveals that even the strongest model achieves only 65.6% transition validity and 66.3% requirement coverage; furthermore, ICG detects state errors 2–16 times more efficiently than conventional checkpoint-based methods.
📝 Abstract
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.