🤖 AI Summary
To address the weak generalization of task progress estimation in open-world environments, this paper proposes a test-time adaptation framework that enables models to online optimize self-supervised objectives using only a single expert trajectory and natural language description, achieving cross-task, cross-environment, and cross-robot morphology progress awareness. Methodologically, we introduce a novel gradient-based meta-learning strategy that guides test-time adaptation toward semantic consistency—rather than temporal ordering—and integrate vision-language joint modeling with self-supervised representation learning. The framework requires training in only one environment yet generalizes robustly to out-of-distribution, highly variable scenarios. Empirically, it significantly outperforms autoregressive vision-language model (VLM)-based in-context learning approaches across cross-task, cross-environment, and cross-embodiment benchmarks, establishing new state-of-the-art generalization performance.
📝 Abstract
We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.