🤖 AI Summary
Existing Vision-Language-Action (VLA) models lack runtime self-reflection capability, rendering them unable to proactively seek assistance before prediction failure. To address this, we propose INSIGHT—a novel framework that systematically models the temporal evolution of token-level uncertainty (including entropy, log-probability, and Dirichlet-estimated aleatoric/epistemic uncertainty) and employs a lightweight Transformer to make sequence-level “seek-help” decisions. Unlike static uncertainty-thresholding approaches, INSIGHT’s temporal modeling significantly improves help-triggering accuracy. Under strong supervision, it precisely captures uncertainty dynamics; under weak supervision—with distribution alignment—it remains competitive. Evaluated on π₀-FAST, INSIGHT enhances VLA system safety and generalization, enabling real-time error mitigation and establishing a new paradigm for proactive learning. (132 words)
📝 Abstract
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present extbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token emph{entropy}, emph{log-probability}, and Dirichlet-based estimates of emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.