Multi-modal user interface control detection using cross-attention

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low accuracy of pixel-only UI control detection under conditions of visual ambiguity, design diversity, and missing contextual information by proposing a multimodal extension of YOLOv5. The approach uniquely incorporates GPT-generated semantic descriptions of UI images, fusing them with visual features through a cross-attention mechanism to enable context-aware control detection. The study systematically evaluates three fusion strategies—element-wise addition, weighted summation, and convolution—on a dataset comprising over 16,000 images across 23 control types. Experimental results demonstrate that convolution-based fusion achieves the best performance, significantly outperforming baseline models, particularly in semantically complex or visually ambiguous edge cases.
📝 Abstract
Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.
Problem

Research questions and friction points this paper is trying to address.

UI control detection
multi-modal learning
visual ambiguity
contextual cues
software screenshots
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal learning
cross-attention
UI control detection
YOLOv5
text-vision fusion
🔎 Similar Papers
No similar papers found.
M
Milad Moradi
AI Research Lab, Tricentis, Vienna, Austria
K
Ke Yan
AI Research Lab, Tricentis, Sydney, Australia
David Colwell
David Colwell
Senior Lecturer, School of Banking and Finance, the University of New South Wales
mathematical financecontinuous-time financederivatives pricing
M
Matthias Samwald
Institute of Artificial Intelligence, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria
R
Rhona Asgari
AI Research Lab, Tricentis, Vienna, Austria