Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

๐Ÿ“… 2026-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing robotic systems struggle to distinguish pedestrians from static obstacles in social navigation and lack proactive collision-avoidance capabilities. This work proposes SALSA, a novel framework that revealsโ€” for the first timeโ€”the inherent social awareness embedded in pretrained vision-language-action (VLA) models. Through a two-stage, annotation-free post-training process, SALSA aligns robot behavior with social norms and temporal safety by leveraging counterfactual human-object scene pairs, bridging intermediate-layer social features, and incorporating self-generated future-risk supervision. Without requiring any additional labeled data, the method reduces near-collision incidents by 86.4% in both the SCAND benchmark and real-world environments, while boosting social counterfactual accuracy from 53% to 93%, substantially enhancing the safety of social navigation.
๐Ÿ“ Abstract
Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.
Problem

Research questions and friction points this paper is trying to address.

social navigation
Vision-Language-Action models
collision avoidance
behavioral alignment
safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models
Social Navigation
Behavioral Alignment
Temporal Safety
Counterfactual Training