ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to map high-level semantic actions onto precise, low-level control sequences for virtual reality (VR) devices. To this end, we introduce ComboBench—the first VR interaction benchmark tailored for embodied control evaluation—comprising 262 diverse scenarios across four mainstream VR games and supporting joint assessment of controller and head-mounted display inputs. We conduct the first systematic evaluation of seven state-of-the-art LLMs (including GPT-4, Gemini-1.5-Pro, and LLaMA-3), using human-annotated ground truth and human performance baselines. Results reveal significant bottlenecks in programmatic reasoning and spatial understanding, with model accuracy degrading as interaction complexity increases; although Gemini-1.5-Pro demonstrates superior task decomposition, all models substantially underperform humans. Few-shot prompting consistently improves accuracy. This work bridges a critical gap in evaluating LLMs’ low-level embodied control competence and establishes a foundational benchmark and actionable insights for VR agent research.

Technology Category

Application Category

📝 Abstract

Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs'capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs'VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to translate semantic actions into VR device manipulations

Assessing LLMs' procedural reasoning and spatial understanding in VR games

Benchmarking LLM performance across diverse VR gaming scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLMs' VR device manipulation capability

Tests semantic-to-device action translation in 262 scenarios

Few-shot learning improves models' VR interaction performance

🔎 Similar Papers

No similar papers found.