Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing robotic systems struggle to distinguish pedestrians from static obstacles in social navigation and lack proactive collision-avoidance capabilities. This work proposes SALSA, a novel framework that reveals— for the first time—the inherent social awareness embedded in pretrained vision-language-action (VLA) models. Through a two-stage, annotation-free post-training process, SALSA aligns robot behavior with social norms and temporal safety by leveraging counterfactual human-object scene pairs, bridging intermediate-layer social features, and incorporating self-generated future-risk supervision. Without requiring any additional labeled data, the method reduces near-collision incidents by 86.4% in both the SCAND benchmark and real-world environments, while boosting social counterfactual accuracy from 53% to 93%, substantially enhancing the safety of social navigation.

📝 Abstract

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

Problem

Research questions and friction points this paper is trying to address.

social navigation

Vision-Language-Action models

collision avoidance

behavioral alignment

safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models

Social Navigation

Behavioral Alignment