🤖 AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models in fine-grained manipulation tasks, where the absence of an effective active visual attention mechanism hinders their ability to focus on task-critical regions. The authors propose a gaze-regularized training framework that requires no architectural modifications or additional inference overhead. By leveraging human gaze heatmaps as supervision, the method converts these into patch-level distributions and uses Kullback–Leibler divergence to regularize the Transformer’s attention maps, thereby guiding the model toward task-relevant features. Compatible with existing datasets and architectures, the approach improves performance by 4–12% across multiple manipulation benchmarks, accelerates convergence, exhibits robustness to lighting variations and sensor noise, and yields interpretable attention visualizations aligned with human strategies.
📝 Abstract
Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.