Zero-Shot Offline Imitation Learning via Optimal Transport

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot imitation learning, goal-conditioned policies often exhibit myopic behavior that compromises long-horizon objective attainment. Method: We propose a non-myopic offline imitation framework that models the goal-conditioned value function as the optimal transport distance between state-action occupancy measures. Leveraging a world model to implicitly estimate occupancy distributions, our approach enables zero-shot task generalization from a single expert demonstration—without online interaction or task-specific retraining. It accommodates suboptimal offline data and inherently enforces long-horizon behavioral consistency. Results: Evaluated on challenging continuous-control benchmarks, our method significantly outperforms existing goal-sequence imitation approaches. It demonstrates superior zero-shot transfer capability, robustness to suboptimal demonstrations, and improved temporal coherence in generated behaviors—validating its effectiveness in addressing myopia while maintaining scalability and practicality.

Technology Category

Application Category

📝 Abstract
Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Mitigates myopic behavior in imitation
Optimizes occupancy matching directly
Learns from offline, suboptimal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot imitation learning
Occupancy matching optimization
Learned world model integration
🔎 Similar Papers
No similar papers found.