🤖 AI Summary
Existing approaches to front-end code generation lack systematic investigation into integrating textual and visual feedback—such as sketches or screenshots—within multi-turn dialogues, hindering accurate user intent understanding and functional consistency. This work introduces FronTalk, a benchmark comprising 100 multi-turn dialogues paired with textual and visual instructions, along with an autonomous web agent–based evaluation framework that jointly assesses functional correctness and user experience. For the first time, multimodal feedback is incorporated into this task, revealing two critical challenges: severe model forgetting and insufficient visual comprehension. The proposed AceCoder method mitigates forgetting through a code critique mechanism, reducing task forgetting to near zero and improving overall performance by 9.3% (from 56.0% to 65.3%), thereby establishing a foundation for multi-turn, multimodal code generation.
📝 Abstract
We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk