🤖 AI Summary
Existing approaches struggle to generalize in non-stationary environments and often model task sequences as linear segments, thereby overlooking the underlying transition topology of workflows and limiting their adaptability to novel scenarios. This work proposes a multimodal multi-agent framework that, during an offline phase, adaptively constructs a graph-structured topological knowledge base. At inference time, it integrates graph-based adaptive retrieval-augmented generation with a closed-loop collaborative verification mechanism to enable dynamic self-correction and automatic execution of nonlinear workflows. By introducing, for the first time, a topological knowledge base coupled with a collaborative verification protocol, the method supports semantic-aware task decomposition and navigation even under limited training data, demonstrating high reliability and strong generalization capabilities in real-world settings.
📝 Abstract
Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.