KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limited visual generalization of existing pixel-based reinforcement learning agents under distribution shifts, a challenge exacerbated by the inability of current benchmarks to disentangle and systematically evaluate individual visual factors. To this end, we introduce KAGE-Env and KAGE-Bench, which for the first time enable factorized disentanglement of visual observations by modeling background, illumination, and agent appearance as independent, controllable axes. This framework provides an isolatable and reproducible evaluation setting for visual generalization while preserving the underlying control task. Implemented in JAX, our efficient 2D platform supports training at 33 million environment steps per second on a single GPU. Experiments reveal that background and photometric variations significantly degrade policy performance, whereas appearance changes have minimal impact; notably, relying solely on reward metrics may obscure certain generalization failures.

Technology Category

Application Category

📝 Abstract

Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: https://avanturist322.github.io/KAGEBench/.

Problem

Research questions and friction points this paper is trying to address.

visual generalization

distribution shift

reinforcement learning

benchmark

pixel-based agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual generalization

factorized observation

known-axis benchmark