OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing vision-language model (VLM) game agent benchmarks, which support only single-attempt, single-player settings and lack a unified protocol for fair evaluation of heterogeneous agents. To overcome this, the authors present a real-time multimodal game benchmark built on Unreal Engine 5, encompassing Solo, Player-versus-Player (PvP), and Cooperative (Coop) modes. They introduce an Improved Dynamic Curriculum (IDC) mechanism that leverages tool-augmented reflective LLMs to iteratively refine skill-based prompting over multiple rounds. For the first time, this platform enables multidimensional evaluation of twelve VLM agents within a consistent environment, revealing the performance evolution of top-performing agents under IDC and their generalization capabilities across task variants.

📝 Abstract

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Problem

Research questions and friction points this paper is trying to address.

VLM agents

game benchmarks

unified evaluation

Improvement Dynamics

heterogeneous agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniGameArena

Vision-Language Model (VLM)

Improvement Dynamics Curve (IDC)