🤖 AI Summary
This work addresses the limited sample efficiency of current robot imitation learning methods, which rely solely on low-level sensor data while neglecting high-level semantic knowledge from humans. Building upon Action Chunking with Transformers (ACT), the study introduces a novel approach that incorporates human-provided chunk-level semantic concepts—such as object attributes and spatial relationships—through a concept-aware cross-attention mechanism. These concepts are used to supervise training only at the final encoder layer, requiring semantic annotations exclusively during demonstration collection and eliminating the need for language inputs at deployment. Evaluated on two robot manipulation tasks with logical constraints, the proposed method significantly outperforms standard ACT, auxiliary prediction loss baselines, and language-conditioned models, achieving faster convergence and higher sample efficiency.
📝 Abstract
Imitation learning enables robots to acquire complex manipulation skills from human demonstrations, but current methods rely solely on low-level sensorimotor data while ignoring the rich semantic knowledge humans naturally possess about tasks. We present ConceptACT, an extension of Action Chunking with Transformers that leverages episode-level semantic concept annotations during training to improve learning efficiency. Unlike language-conditioned approaches that require semantic input at deployment, ConceptACT uses human-provided concepts (object properties, spatial relationships, task constraints) exclusively during demonstration collection, adding minimal annotation burden. We integrate concepts using a modified transformer architecture in which the final encoder layer implements concept-aware cross-attention, supervised to align with human annotations. Through experiments on two robotic manipulation tasks with logical constraints, we demonstrate that ConceptACT converges faster and achieves superior sample efficiency compared to standard ACT. Crucially, we show that architectural integration through attention mechanisms significantly outperforms naive auxiliary prediction losses or language-conditioned models. These results demonstrate that properly integrated semantic supervision provides powerful inductive biases for more efficient robot learning.