OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents

📅 2025-02-26

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Multimodal large language model (MLLM)-driven GUI agents commonly suffer from “over-execution”—uncontrolled autonomous actions under ambiguous instructions, environmental shifts, or unexpected interruptions—due to the absence of confidence estimation and human–agent collaboration mechanisms. Method: We propose the first adaptive multimodal GUI agent framework, integrating an MLLM, a lightweight confidence prediction module, and a collaborative data construction strategy. Its core innovations are: (1) a cooperative probe annotation mechanism enabling interaction-level confidence modeling, and (2) a confidence-driven dynamic intervention paradigm supporting real-time switching between autonomous execution and human-in-the-loop correction. Contribution/Results: Evaluated on our custom complex-scenario dataset and established benchmarks (AITZ, Meta-GUI), the framework achieves 24.59%–87.29% absolute improvement in task success rate, significantly enhancing robustness, generalization, and practical scalability.

Technology Category

Application Category

📝 Abstract

Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59%$sim$87.29% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at https://github.com/Wuzheng02/OS-Kairos.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-execution in autonomous GUI agents

Improves adaptive human-agent collaboration confidence

Handles ambiguous instructions and unexpected interruptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive GUI agent with confidence prediction

Collaborative probing for confidence annotation

Confidence-driven interaction for human intervention

🔎 Similar Papers

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents