Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current vision-language models struggle to accurately assess present or imminent collisions in human-robot collaboration, lacking physical grounding. This work introduces the novel task of “collision grounding,” formally defining the problem of perceiving contact states between robots and their environment or humans. We present TouchSafeBench, the first multimodal, physically realistic benchmark tailored to human-robot coexistence scenarios, built upon Habitat 3.0 and integrating multi-view RGB-D images, trajectory maps, camera metadata, and simulated contact labels. Evaluation across 2,940 indoor scenes reveals that state-of-the-art models achieve a Macro-F1 score below 50%, fail to effectively leverage depth information, and find robot-scene contact recognition significantly more challenging than predicting human proximity risks.

📝 Abstract

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

collision grounding

vision-language models

human-robot collaboration

safety monitoring

physical accountability

Innovation

Methods, ideas, or system contributions that make the work stand out.

collision grounding

vision-language models

human-robot collaboration