Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic and attentional ambiguities in gaze-language interaction for smart glasses—where spoken queries suffer from referential uncertainty, gaze data is noisy, and gaze-language coupling exhibits complex spatiotemporal dynamics—we propose GLARIFY. Our method comprises: (1) a dynamic gaze heatmap module that models spatiotemporal attention evolution; (2) a chain-of-thought mechanism for robust decoding of noisy gaze sequences; and (3) GLARIFY-Ambi, a synthetic dataset built upon GPT-4o, covering diverse ambiguous gaze patterns. GLARIFY integrates heatmap-based attention into vision-language models without disrupting their pre-trained knowledge, significantly improving multimodal query accuracy and interaction naturalness. Experiments demonstrate substantial performance gains over state-of-the-art baselines in real-world settings. GLARIFY establishes a new paradigm for interpretable and robust attention modeling in gaze-driven visual assistants.

Technology Category

Application Category

📝 Abstract
With the rise in popularity of smart glasses, users' attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users' attention may introduce ambiguity challenges: (1) users' verbal questions become ambiguous by using pronouns or skipping context, (2) humans' gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user's attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model's effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguity in gaze-facilitated visual assistant interactions
Handling noisy gaze patterns with complex spatiotemporal relationships
Aligning vision-language models with dynamic human attention data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging spatiotemporal gaze data to reduce ambiguity
Using GPT-4o to synthesize dataset with chain-of-thought
Incorporating gaze heatmaps into pretrained vision-language models
🔎 Similar Papers
No similar papers found.
Z
Zeyu Wang
Key Laboratory of Pervasive Computing, Tsinghua University
B
Baiyu Chen
The University of New South Wales
Kun Yan
Kun Yan
Beihang University & Microsoft Research
Natural Language ProcessingComputer Vision
H
Hongjing Piao
Key Laboratory of Pervasive Computing, Tsinghua University
Hao Xue
Hao Xue
University of New South Wales
human mobilityspatio-temporal data mining
F
Flora D. Salim
The University of New South Wales
Yuanchun Shi
Yuanchun Shi
Professor
human computer interaction
Yuntao Wang
Yuntao Wang
Tsinghua University
Human-Computer InteractionUbiquitous ComputingPhysio-Behavioral Computing