🤖 AI Summary
This work addresses the low accuracy of pixel-only UI control detection under conditions of visual ambiguity, design diversity, and missing contextual information by proposing a multimodal extension of YOLOv5. The approach uniquely incorporates GPT-generated semantic descriptions of UI images, fusing them with visual features through a cross-attention mechanism to enable context-aware control detection. The study systematically evaluates three fusion strategies—element-wise addition, weighted summation, and convolution—on a dataset comprising over 16,000 images across 23 control types. Experimental results demonstrate that convolution-based fusion achieves the best performance, significantly outperforming baseline models, particularly in semantically complex or visually ambiguous edge cases.
📝 Abstract
Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.