🤖 AI Summary
This work addresses the challenge of assessing safety risks arising from identical human actions performed with different underlying intentions. To this end, the authors propose an intention-aware safety assurance framework that leverages first-person video to fuse visual and linguistic cues for real-time prediction of action intent and dynamic safety evaluation. The core contributions include the first goal-conditioned safety Q-filter and an intention-action joint prediction agent, which enables intent-driven safety interventions without requiring model retraining. The Q-filter is trained using the GRPO algorithm and integrated with constrained decoding and a multimodal vision-language model for robust intent recognition and action forecasting. Evaluated on the ASIMOV-2.0 benchmark, the method significantly outperforms baseline approaches in intervention accuracy, with goal-conditioned constrained decoding improving action safety by over 41 percentage points.
📝 Abstract
As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.