Action Tokenizer Matters in In-Context Imitation Learning

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In instruction-conditioned imitation learning (ICIL), existing action tokenizers encode action trajectories but neglect temporal smoothness, leading to unstable robot execution. This work identifies this critical limitation for the first time and proposes LipVQ-VAE—a novel vector-quantized variational autoencoder incorporating Lipschitz continuity constraints—to explicitly preserve spatiotemporal smoothness of raw actions in the discrete latent space. By jointly optimizing weight normalization and Lipschitz-bounded layers, the model ensures both continuity and differentiability of quantized action representations. Evaluated in high-fidelity simulation, LipVQ-VAE improves task success rate by over 5.3% compared to prior tokenizers. Real-robot experiments further demonstrate significantly smoother generated trajectories and enhanced execution robustness. This work establishes a new paradigm for stable action representation learning in ICIL, bridging the gap between discrete tokenization and continuous control requirements.

Technology Category

Application Category

📝 Abstract
In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints will be released.
Problem

Research questions and friction points this paper is trying to address.

Explores action tokenizer impact in in-context imitation learning.
Addresses lack of temporal smoothness in action encoding.
Proposes LipVQ-VAE for smoother, more stable robotic actions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LipVQ-VAE for action tokenization
Enforces Lipschitz condition for smooth actions
Improves ICIL performance by over 5.3%
🔎 Similar Papers
No similar papers found.
An Dinh Vuong
An Dinh Vuong
Mohamed bin Zayed University of Artificial Intelligence
computer visionrobotic learningreinforcement learning
Minh Nhat Vu
Minh Nhat Vu
Automation & Control Institute (ACIN), Vienna, Austria
Robotics
D
Dong An
Department of Computer Vision, Mohammed bin Zayed University of Artificial Intelligence, UAE
I
Ian Reid
Department of Computer Vision, Mohammed bin Zayed University of Artificial Intelligence, UAE