🤖 AI Summary
Existing general-purpose agents suffer from overreliance on monolithic models or exhibit excessive module coupling in multimodal interaction. To address this, we propose a dual-path modular multimodal agent specifically designed for automated computer interaction. Our approach introduces the first synergistic architecture integrating tool-augmented agents with pure vision agents, supporting heterogeneous inputs—including text, images, audio, and video—while enabling decoupled task solving via cross-modal perceptual alignment, modular task orchestration, and stepwise reasoning. We conduct the first unified evaluation across heterogeneous benchmarks—OSWorld, GAIA, and SWE-Bench—achieving 7.27% accuracy on OSWorld, surpassing Claude-Computer-Use. All code and evaluation scripts are publicly released.
📝 Abstract
This paper introduces extsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $mathbf{7.27%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.