Explorer: Robust Collection of Interactable GUI Elements

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

GUI automation faces challenges including low-precision detection of interactive elements (e.g., buttons, text fields), poor cross-platform adaptability, and insufficient modeling of dynamic UI states. To address these, this paper proposes an application-specific, real-time interactive element acquisition paradigm that jointly leverages screen perception, lightweight visual detection, runtime UI tree parsing, and session trajectory modeling to construct a user operation state graph—enabling real-time, cross-platform recognition across Android and desktop Chrome. We introduce the first closed-loop GUI path planning mechanism, supporting end-to-end navigation driven by voice commands. The system is open-sourced and significantly improves confidence in interactive element identification and accessibility-oriented automation capability. Our work establishes a novel methodology for GUI understanding and scriptable automation.

Technology Category

Application Category

📝 Abstract

Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

Problem

Research questions and friction points this paper is trying to address.

Automating GUI interaction poses significant challenges

Precise detection of UI elements for specific apps is crucial

Enabling path planning through GUI via audio commands

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects on-screen buttons and text-entry fields

Uses live application for personalized ML-training

Enables path planning via recorded user sessions

🔎 Similar Papers

Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data