Transparent and Coherent Procedural Mistake Detection

📅 2024-12-16

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This paper addresses procedural mistake detection in first-person videos. We propose PMD, a transparent and interpretable visual self-dialogue framework. Methodologically, we pioneer the integration of vision-language models (VLMs) with natural language inference (NLI) models to construct a frame-level fine-grained benchmark; introduce rationale-guided reasoning generation; and design coherence-aware fine-tuning and decoding strategies to automatically produce visualization-supported reasoning chains for decision-making. Our contributions are threefold: (1) the first self-dialogue attribution paradigm for procedural error detection; (2) an NLI-based automatic evaluation framework for rational coherence, enabling multidimensional, quantifiable assessment; and (3) empirical validation demonstrating significant improvements in detection accuracy, reasoning coherence, and inference efficiency—while uncovering a fundamental accuracy-coherence trade-off and identifying key optimization directions.

Technology Category

Application Category

📝 Abstract

Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that while VLMs struggle off-the-shelf, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods- though not without tradeoff. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.

Problem

Research questions and friction points this paper is trying to address.

Classify human task execution success via egocentric video

Generate visual self-dialog rationales for transparent decisions

Improve accuracy and coherence of procedural mistake detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating visual self-dialog rationales for PMD

Using vision-and-language models for frame-based PMD

Leveraging NLI for automated coherence metrics

🔎 Similar Papers

No similar papers found.

Authors to Follow