TextAtari: 100K Frames Game Playing with Language Agents

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates fundamental limitations of language agents in ultra-long-horizon decision-making tasks (up to 100,000 steps). To address the lack of standardized benchmarks, we introduce TextAtari—the first large-scale text-based reinforcement learning benchmark—converting Atari game visual states into structured natural-language descriptions across nearly 100 diverse tasks. We propose AtariARI, an unsupervised representation learning framework that reliably maps visual states to semantically grounded textual abstractions. We systematically evaluate state-of-the-art open-weight language models—including Qwen2.5-7B, Gemma-7B, and Llama3.1-8B—under zero-shot and few-shot chain-of-thought and reflective reasoning paradigms for long-horizon planning. Results reveal substantial performance gaps versus human players in state tracking, cross-step reasoning, and strategic planning, exposing core deficiencies in semantic grounding and persistent instruction following. To foster reproducible research, we publicly release the benchmark, evaluation protocols, and baseline implementations—establishing a standardized infrastructure for long-duration language agent research.

Technology Category

Application Category

📝 Abstract
We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating language agents on long-horizon decision-making tasks
Bridging sequential decision-making with natural language processing
Assessing performance gaps between language agents and humans
Innovation

Methods, ideas, or system contributions that make the work stand out.

TextAtari converts Atari visuals to text descriptions
Evaluates LLMs on 100 diverse long-horizon tasks
Tests reasoning methods like reflection and CoT
🔎 Similar Papers
No similar papers found.
W
Wenhao Li
Tongji University, Shanghai, China
W
Wenwu Li
Tongji University, Shanghai, China
C
Chuyun Shen
East China Normal University, Shanghai, China
Junjie Sheng
Junjie Sheng
East China Normal University
Learning From FeedbackMulti-AgentScheduling&Planning
Z
Zixiao Huang
East China Normal University, Shanghai, China
D
Di Wu
Tongji University, Shanghai, China
Y
Yun Hua
Shanghai Jiao Tong University, Shanghai, China
Wei Yin
Wei Yin
Staff Research Scientist, Horizon Robotics
World ModelGenerative AIPhysical AI
X
Xiangfeng Wang
East China Normal University, Shanghai, China
Hongyuan Zha
Hongyuan Zha
The Chinese University of Hong Kong, Shenzhen
machine learning
B
Bo Jin
Tongji University, Shanghai, China