Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from high inference latency, low sample efficiency, and poor deployability on edge devices for desktop application control. To address these challenges, we propose ComputerAgent, a lightweight hierarchical reinforcement learning framework featuring a two-tier policy architecture and a trimodal state encoder—integrating screen captures, task identifiers, and numerical state representations. Our approach introduces three key innovations: a hierarchical options architecture for temporally abstracted decision-making, a meta-action early-stopping mechanism to curtail unnecessary interactions, and a compact visual backbone for efficient feature extraction. Compared to state-of-the-art MLLMs, ComputerAgent reduces model size by over four orders of magnitude and cuts inference latency by 50%, enabling real-time on-device execution. Evaluated on 135 real-world desktop tasks, it achieves 92.1% success rate on simple tasks and 58.8% on complex ones—matching the performance of a 200B-parameter MLLM while significantly improving sample efficiency and practicality for long-horizon interactive tasks.

Technology Category

Application Category

📝 Abstract
Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (MLLMs) ingest screenshots and task instructions to generate keystrokes and mouse events, but they suffer from prohibitive inference latency, poor sample efficiency on long-horizon sparse-reward tasks, and infeasible on-device deployment. We introduce a lightweight hierarchical reinforcement learning framework, ComputerAgent, that formulates OS control as a two-level option process (manager and subpolicy), employs a triple-modal state encoder (screenshot, task ID, numeric state) to handle visual and contextual diversity, integrates meta-actions with an early-stop mechanism to reduce wasted interactions, and uses a compact vision backbone plus small policy networks for on-device inference (15M parameters). On a suite of 135 real-world desktop tasks, ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on hard tasks (>=8 steps), matching or exceeding 200B-parameter MLLM baselines on simple scenarios while reducing model size by over four orders of magnitude and halving inference time. These results demonstrate that hierarchical RL offers a practical, scalable alternative to monolithic MLLM-based automation for computer control.
Problem

Research questions and friction points this paper is trying to address.

Addressing prohibitive inference latency in desktop application control
Improving sample efficiency on long-horizon sparse-reward automation tasks
Enabling feasible on-device deployment for computer control systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical reinforcement learning with manager-subpolicy options
Triple-modal state encoder for visual and contextual inputs
Compact vision backbone and policy networks for on-device deployment