I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting failures in language-conditioned robotic manipulation within open-world settings remains challenging—particularly identifying *semantic misalignment*, where executed actions are physically plausible yet semantically inconsistent with the given instruction. Method: We propose I-FailSense, the first framework to formally define, model, and detect semantic misalignment failures. It introduces (i) the first dedicated multi-scenario failure detection dataset; (ii) a lightweight, plug-and-play Feature Synthesis (FS) classification head that fuses multi-level internal representations from vision-language models; and (iii) an ensemble arbitration mechanism enabling zero-shot cross-environment transfer. Robust simulation-to-real generalization is achieved via post-training. Results: I-FailSense significantly outperforms both same-scale and larger state-of-the-art models across multiple benchmarks. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).
Problem

Research questions and friction points this paper is trying to address.

Detecting semantic misalignment errors in robotic task execution
Improving failure recognition capabilities in vision-language models
Enabling robust robotic deployment through generalized failure detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training base VLM for failure detection
Lightweight FS blocks on internal layers
Ensemble mechanism aggregates predictions
🔎 Similar Papers
No similar papers found.