🤖 AI Summary
This work investigates the capability of large language models (LLMs) to map high-level semantic actions onto precise, low-level control sequences for virtual reality (VR) devices. To this end, we introduce ComboBench—the first VR interaction benchmark tailored for embodied control evaluation—comprising 262 diverse scenarios across four mainstream VR games and supporting joint assessment of controller and head-mounted display inputs. We conduct the first systematic evaluation of seven state-of-the-art LLMs (including GPT-4, Gemini-1.5-Pro, and LLaMA-3), using human-annotated ground truth and human performance baselines. Results reveal significant bottlenecks in programmatic reasoning and spatial understanding, with model accuracy degrading as interaction complexity increases; although Gemini-1.5-Pro demonstrates superior task decomposition, all models substantially underperform humans. Few-shot prompting consistently improves accuracy. This work bridges a critical gap in evaluating LLMs’ low-level embodied control competence and establishes a foundational benchmark and actionable insights for VR agent research.
📝 Abstract
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs'capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs'VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.