Agent Skills Should Go Beyond Text: The Case for Visual Skills

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
Existing approaches to agent skill learning predominantly rely on textual representations, which struggle with vision-centric tasks requiring spatial layout understanding, visual grounding, and fine-grained appearance judgment. This work introduces the concept of “visual skills” in a systematic manner and proposes a multimodal skill paradigm that integrates textual logic with explicit visual information. By leveraging static and dynamic priors, the method constructs reusable assets through interleaved visual skills and incorporates a trajectory auto-parsing system (SYSTEM) to extract multimodal skills directly from task execution trajectories. Beyond specifying *what* to do, the framework encodes *where* to look, *how* to inspect, and *how* to verify visual outcomes. Experiments demonstrate substantial improvements over purely text-based skill representations, particularly in scenarios demanding spatial correspondence, visual evidence reasoning, and state awareness, such as GUI automation tasks.
📝 Abstract
Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.
Problem

Research questions and friction points this paper is trying to address.

visual skills
multimodal agents
reusable skills
visual grounding
spatial layout
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal skills
visual grounding
reusable agent experience
spatial reasoning
visual-centric tasks
🔎 Similar Papers
No similar papers found.