🤖 AI Summary
Current vision-language models (VLMs) exhibit severe limitations in understanding and reasoning about physical–social norms—such as safety, privacy, and cooperation—in embodied settings, hindering real-world deployment. To address this gap, we introduce EgoNormia: the first benchmark for evaluating physical–social norm comprehension in first-person interactive videos. It comprises 1,853 videos and 3,706 dual-task questions (behavior prediction + plausibility explanation), spanning seven fine-grained normative dimensions—including safety, privacy, and politeness. We design a scalable construction pipeline integrating video sampling, automated question generation, and human verification, and propose a RAG-enhanced normative reasoning method. Experiments reveal that state-of-the-art VLMs achieve only 45% accuracy on EgoNormia—far below human performance (92%)—highlighting a critical capability gap; RAG-based augmentation yields significant gains. This work establishes a foundational benchmark and methodology for assessing and improving social compliance in embodied AI systems.
📝 Abstract
Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $|epsilon|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.