Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of egocentric vision models in fine-grained future event prediction and current activity understanding. We propose an attention regularization method that incorporates human eye-tracking signals exclusively during training—without altering model architecture or inference procedures—thereby enabling gaze-guided joint visual–linguistic representation learning by aligning the visual language model’s (VLM) attention distributions with ground-truth fixation regions. The approach is highly modular and architecture-agnostic. Extensive evaluation across multiple benchmarks demonstrates substantial improvements: semantic scores for future event prediction increase by up to 11.0, while current activity recognition accuracy improves by approximately 7.0 percentage points, significantly enhancing both predictive accuracy and cross-scenario robustness. To our knowledge, this is the first work to systematically integrate gaze-based regularization into the VLM training paradigm, offering a scalable, minimally invasive pathway for perceptual alignment in egocentric understanding.

Technology Category

Application Category

📝 Abstract
Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs for egocentric behavior understanding tasks
Aligning model attention with human gaze during training
Improving prediction accuracy for future events and current activities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gaze only during training phase
Aligns model attention with human gaze
Modular design generalizes across VLM architectures
🔎 Similar Papers
No similar papers found.