🤖 AI Summary
In zero-shot imitation learning, goal-conditioned policies often exhibit myopic behavior that compromises long-horizon objective attainment.
Method: We propose a non-myopic offline imitation framework that models the goal-conditioned value function as the optimal transport distance between state-action occupancy measures. Leveraging a world model to implicitly estimate occupancy distributions, our approach enables zero-shot task generalization from a single expert demonstration—without online interaction or task-specific retraining. It accommodates suboptimal offline data and inherently enforces long-horizon behavioral consistency.
Results: Evaluated on challenging continuous-control benchmarks, our method significantly outperforms existing goal-sequence imitation approaches. It demonstrates superior zero-shot transfer capability, robustness to suboptimal demonstrations, and improved temporal coherence in generated behaviors—validating its effectiveness in addressing myopia while maintaining scalability and practicality.
📝 Abstract
Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.