Reading Recognition in the Wild

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the critical challenge of real-time reading behavior recognition in always-on smart glasses by formally defining and solving the first-person video-based reading detection task in unconstrained, “in-the-wild” settings. We introduce the first large-scale, in-the-wild multimodal reading dataset (100 hours), comprising synchronized eye movements, head pose, and RGB video—demonstrating their complementary information for robust reading inference. We propose a lightweight, configurable Transformer architecture that supports flexible unimodal or multimodal fusion and integrates self-supervised representation learning. Our method achieves high-accuracy binary classification (reading vs. non-reading) in real-world scenarios and further extends to fine-grained reading type recognition (e.g., skimming, focused reading). Extensive experiments demonstrate substantial improvements in generalization and practical deployability. This work establishes a new benchmark and provides an effective, on-device–ready tool for egocentric context-aware computing.

Technology Category

Application Category

📝 Abstract

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.

Problem

Research questions and friction points this paper is trying to address.

Detect when users are reading using smart glasses

Develop multimodal dataset for real-world reading scenarios

Explore effective encoding of RGB, gaze, and head pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset for reading recognition

Transformer model with RGB, gaze, pose

Efficient encoding of diverse modalities

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker