🤖 AI Summary
This work addresses the weak generalization and heavy reliance on domain-specific engineering exhibited by large language models (LLMs) and vision-language models (VLMs) in multi-turn interactive game environments. We propose the first general-purpose modular framework that decouples perception, memory, and reasoning into independent, interchangeable components—enabling plug-and-play integration of arbitrary LLM or VLM backbones without task-specific customization. Evaluated uniformly across classic (e.g., Zork) and modern (e.g., ALFWorld, VoxSim) game benchmarks, our framework reveals systematic component contributions: memory dominates performance gains in long-horizon puzzles, while perception is critical under high visual interference. Experiments demonstrate consistent outperformance over end-to-end baselines across diverse tasks, significantly improving robustness and adaptability in dynamic, interactive settings. The framework establishes an interpretable, scalable architectural paradigm for general embodied intelligence.
📝 Abstract
We introduce a modular harness design for LLM agents that composes of perception, memory, and reasoning components, enabling a single LLM or VLM backbone to tackle a wide spectrum of multi turn gaming environments without domain-specific engineering. Using classic and modern game suites as low-barrier, high-diversity testbeds, our framework provides a unified workflow for analyzing how each module affects performance across dynamic interactive settings. Extensive experiments demonstrate that the harness lifts gameplay performance consistently over un-harnessed baselines and reveals distinct contribution patterns, for example, memory dominates in long-horizon puzzles while perception is critical in vision noisy arcades. These findings highlight the effectiveness of our modular harness design in advancing general-purpose agent, given the familiarity and ubiquity of games in everyday human experience.