🤖 AI Summary
In outdoor augmented reality (AR), static 3D content frequently misaligns with dynamic physical environments, degrading spatial registration and contextual understanding. To address this, we propose an in-situ correction system powered by multimodal large language models (MLLMs). Our method jointly analyzes the original authored view and real-time camera footage to perform visual-semantic reasoning, automatically detecting misalignments and generating geometrically and semantically consistent 3D scene updates. This work introduces the first application of MLLMs to runtime visual-semantic alignment in outdoor AR, enabling fully autonomous adaptation to environmental dynamics without manual intervention. Experimental evaluation demonstrates significant improvements in long-term AR content stability, spatial consistency across real-world scenes, and overall user experience.
📝 Abstract
Site-specific outdoor AR experiences are typically authored using static 3D models, but are deployed in physical environments that change over time. As a result, virtual content may become misaligned with its intended real-world referents, degrading user experience and compromising contextual interpretation. We present AdjustAR, a system that supports in-situ correction of AR content in dynamic environments using multimodal large language models (MLLMs). Given a composite image comprising the originally authored view and the current live user view from the same perspective, an MLLM detects contextual misalignments and proposes revised 2D placements for affected AR elements. These corrections are backprojected into 3D space to update the scene at runtime. By leveraging MLLMs for visual-semantic reasoning, this approach enables automated runtime corrections to maintain alignment with the authored intent as real-world target environments evolve.