Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

A significant accessibility metadata gap exists across macOS applications, with only 33% providing full accessibility support—severely hindering AI agent interaction and accessibility for visually impaired users. To address this, we propose the first method for real-time hierarchical accessibility tree (AX Tree) generation from a single application screenshot, bypassing reliance on native system APIs. Our approach synergistically integrates vision-language models with UI object detection to jointly perform element identification, semantic description, and hierarchical structure reconstruction. We introduce Screen2AX-Task, the first benchmark for macOS desktop task accessibility evaluation, and release a large-scale, open-source dataset comprising 112 macOS applications. Our method achieves 77% F1 score on AX tree reconstruction—2.2× higher than the native macOS accessibility API—and outperforms OmniParser V2 on the ScreenSpot benchmark.

Technology Category

Application Category

📝 Abstract

Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.

Problem

Research questions and friction points this paper is trying to address.

Generates macOS accessibility metadata from screenshots

Addresses incomplete accessibility support in macOS apps

Improves AI agent interaction with desktop interfaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language models detect UI elements

Generates hierarchical accessibility metadata

Public datasets for macOS applications

🔎 Similar Papers

No similar papers found.