FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing imitation learning approaches rely solely on optimal demonstrations, failing to exploit useful information from suboptimal samples—resulting in low data efficiency and error propagation. To address this, we propose a language-augmented imitation learning framework that directly incorporates natural language feedback into the input sequence of a Transformer-based policy network. We further introduce feedback prediction as an auxiliary self-supervised task, enabling the model to discriminate and extract effective behavioral patterns from non-optimal demonstrations. By replacing sparse scalar rewards with fine-grained linguistic signals, our method enhances policy robustness and compositional generalization. Evaluated on multiple embodied vision-language tasks in the BabyAI-XGen environment, our approach achieves significant improvements in compositional generalization and robustness to noisy demonstrations. These results empirically validate language feedback as an interpretable, high-information-density supervisory signal—demonstrating both efficacy and competitiveness over conventional reward-based or demonstration-only paradigms.

Technology Category

Application Category

📝 Abstract
Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents' compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.
Problem

Research questions and friction points this paper is trying to address.

Learning robust policies from both optimal and suboptimal demonstrations
Improving data efficiency in imitation learning with language feedback
Enhancing compositional generalization in vision-and-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses language feedback to contextualize optimal and suboptimal demonstrations
Integrates feedback embeddings into Transformer-based policy input
Adds auxiliary self-supervised objectives for feedback prediction
🔎 Similar Papers
No similar papers found.