đ¤ AI Summary
This work addresses the limitations of existing vision-language model (VLM) game agent benchmarks, which support only single-attempt, single-player settings and lack a unified protocol for fair evaluation of heterogeneous agents. To overcome this, the authors present a real-time multimodal game benchmark built on Unreal Engine 5, encompassing Solo, Player-versus-Player (PvP), and Cooperative (Coop) modes. They introduce an Improved Dynamic Curriculum (IDC) mechanism that leverages tool-augmented reflective LLMs to iteratively refine skill-based prompting over multiple rounds. For the first time, this platform enables multidimensional evaluation of twelve VLM agents within a consistent environment, revealing the performance evolution of top-performing agents under IDC and their generalization capabilities across task variants.
đ Abstract
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.