🤖 AI Summary
This work addresses the unnatural, explicit-window-management interactions in XR headsets. We propose a novel “Put-That-There” paradigm integrating large language models (LLMs) with multimodal XR sensing. Our method jointly leverages semantic-segmentation-driven 3D environment reconstruction, real-time application metadata, speech commands, pointing gestures, and eye-tracking data; an LLM performs goal-directed intent understanding and one-to-many operation mapping to dynamically infer application invocation, window placement, and cross-tool spatial layout relationships, outputting structured JSON control instructions. Contributions include: (1) the first deep integration of LLMs into the spatial interaction feedback loop, enabling end-to-end reasoning from high-level semantic goals (e.g., “place the email window on the desk directly in front of me”) to physical-space actions; and (2) significant improvements in naturalness, intent consistency, and cross-application coordination efficiency for window management within panoramic workspaces.
📝 Abstract
We revisit Bolt's classic"Put-That-There"concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users'verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.