🤖 AI Summary
To address weak generalization in cross-platform real-time strategy (RTS) game automation—caused by interface heterogeneity and dynamic battlefield conditions—this paper proposes a vision-language-driven closed-loop agent framework. Methodologically: (1) it introduces a “compositional granularity” representation mechanism to uniformly process static images, multi-frame sequences, and full videos; (2) it designs an MV+S hybrid data strategy, synergizing Qwen2.5-VL’s multimodal reasoning with UI-TARS’s precise UI execution to enable end-to-end closed-loop decision-making for target localization, resource allocation, and regional control. Key contributions include enhanced cross-platform generalization and an optimized multimodal fusion architecture. Experiments demonstrate a 63% reduction in inference latency, a BLEU-4 score of 62.41% (+57.6 percentage points over baselines), and significant improvements in both real-time responsiveness and task completion rate, surpassing state-of-the-art approaches.
📝 Abstract
Automated operation in cross-platform strategy games demands agents with robust generalization across diverse user interfaces and dynamic battlefield conditions. While vision-language models (VLMs) have shown considerable promise in multimodal reasoning, their application to complex human-computer interaction scenarios--such as strategy gaming--remains largely unexplored. Here, we introduce Yanyun-3, a general-purpose agent framework that, for the first time, enables autonomous cross-platform operation across three heterogeneous strategy game environments. By integrating the vision-language reasoning of Qwen2.5-VL with the precise execution capabilities of UI-TARS, Yanyun-3 successfully performs core tasks including target localization, combat resource allocation, and area control. Through systematic ablation studies, we evaluate the effects of various multimodal data combinations--static images, multi-image sequences, and videos--and propose the concept of combination granularity to differentiate between intra-sample fusion and inter-sample mixing strategies. We find that a hybrid strategy, which fuses multi-image and video data while mixing in static images (MV+S), substantially outperforms full fusion: it reduces inference time by 63% and boosts the BLEU-4 score by a factor of 12 (from 4.81% to 62.41%, approximately 12.98x). Operating via a closed-loop pipeline of screen capture, model inference, and action execution, the agent demonstrates strong real-time performance and cross-platform generalization. Beyond providing an efficient solution for strategy game automation, our work establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering new insights into the interplay between static perception and dynamic reasoning in embodied intelligence.