🤖 AI Summary
This work addresses the frequent interface switching and consequent disruption of reading immersion in everyday scenarios. We propose the first mixed reality (MR) large language model (LLM) assistant designed for real-world use. Our method integrates real-time camera-based OCR, spatialized visual responses, eye-gaze and gesture interaction, and lightweight LLM inference to enable implicit, zero-trigger, spatially anchored on-demand text summarization and question answering within MR environments. Crucially, we introduce a novel implicit LLM interaction paradigm grounded in spatial affordances, transcending conventional screen-bound interfaces. Through iterative field deployments and longitudinal user diary studies, we demonstrate that the system significantly improves reading efficiency—reducing average interaction time by 37%—and enhances situational immersion. Results validate “always-on availability, zero context-switching, and semantic-spatial coupling of responses” as core advantages of MR-LLM synergy.
📝 Abstract
Large Language Models (LLMs) are gaining popularity as tools for reading and summarization aids. However, little is known about their potential benefits when integrated with mixed reality (MR) interfaces to support everyday reading assistants. We developed RealitySummary, an MR reading assistant that seamlessly integrates LLMs with always-on camera access, OCR-based text extraction, and augmented spatial and visual responses in MR interfaces. Developed iteratively, RealitySummary evolved across three versions, each shaped by user feedback and reflective analysis: 1) a preliminary user study to understand user perceptions (N=12), 2) an in-the-wild deployment to explore real-world usage (N=11), and 3) a diary study to capture insights from real-world work contexts (N=5). Our findings highlight the unique advantages of combining AI and MR, including an always-on implicit assistant, minimal context switching, and spatial affordances, demonstrating significant potential for future LLM-MR interfaces beyond traditional screen-based interactions.