Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

๐Ÿ“… 2025-08-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multimodal large language models (MLLMs) struggle to accurately generate UI element coordinates in GUI structuralization, primarily due to weak semantic representation of numerical coordinates in the language space and insufficient generalization capability of the vision encoder. Method: We propose the IoU-Augmented Maximum Likelihood (IAML) training framework, which introduces an Intersection-over-Union (IoU)-based coordinate sampling strategy to mitigate exposure bias and employs IoU-enhanced supervision to refine coordinate regression objectives. The MLLM is fine-tuned on augmented annotation data incorporating this geometrically informed signal. Contribution/Results: IAML significantly improves UI element localization accuracy and action-space prediction fidelity. Experiments across multiple GUI structuralization benchmarks demonstrate consistent superiority over standard maximum likelihood training: average coordinate error decreases by 23.6%. This establishes a more robust visionโ€“language alignment paradigm for instruction-driven interface understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
Problem

Research questions and friction points this paper is trying to address.

Improving UI coordinate generation precision in MLLMs
Addressing semantic void in numerical coordinate representation
Mitigating exposure bias in traditional likelihood estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

IoU-Augmented Maximum Likelihood training paradigm
IoU-based coordinate sampling pipeline
Fine-tunes MLLMs to mitigate exposure bias
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yi Xu
Shanghai Jiao Tong University
Yesheng Zhang
Yesheng Zhang
Shanghai Jiao Tong University
computer vision
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jingdong Chen
Ant Group