EgoNormia: Benchmarking Physical Social Norm Understanding

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit severe limitations in understanding and reasoning about physical–social norms—such as safety, privacy, and cooperation—in embodied settings, hindering real-world deployment. To address this gap, we introduce EgoNormia: the first benchmark for evaluating physical–social norm comprehension in first-person interactive videos. It comprises 1,853 videos and 3,706 dual-task questions (behavior prediction + plausibility explanation), spanning seven fine-grained normative dimensions—including safety, privacy, and politeness. We design a scalable construction pipeline integrating video sampling, automated question generation, and human verification, and propose a RAG-enhanced normative reasoning method. Experiments reveal that state-of-the-art VLMs achieve only 45% accuracy on EgoNormia—far below human performance (92%)—highlighting a critical capability gap; RAG-based augmentation yields significant gains. This work establishes a foundational benchmark and methodology for assessing and improving social compliance in embodied AI systems.

Technology Category

Application Category

📝 Abstract
Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $|epsilon|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating normative reasoning in vision-language models.
Addressing lack of norm understanding in AI systems.
Improving safety, privacy, and collaboration in AI agents.
Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoNormia dataset with 1,853 ego-centric videos
Pipeline for video sampling and answer generation
Retrieval-based method to enhance normative reasoning
🔎 Similar Papers
No similar papers found.