🤖 AI Summary
Malicious virtual content in augmented reality (AR) poses critical security risks—including occlusion of safety-critical information and covert manipulation of user perception—threatening user trust and safety.
Method: This paper proposes ViDDAR and VIM-Sense, a dual-system collaborative defense framework that integrates vision-language models (VLMs) with multimodal reasoning modules to establish a perception-aligned content quality assessment mechanism; it further introduces a lightweight model adaptation strategy for real-time deployment on resource-constrained AR devices.
Contribution/Results: To our knowledge, this is the first work to jointly leverage multimodal semantic understanding and human perceptual modeling for fine-grained detection of both occlusion-based and manipulation-based attacks. Extensive experiments demonstrate high robustness and scalability in realistic AR scenarios. The framework establishes the first end-to-end, human-centered technical pathway and research paradigm for trustworthy AR content governance.
📝 Abstract
As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.