Transparent and Coherent Procedural Mistake Detection

📅 2024-12-16
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses procedural mistake detection in first-person videos. We propose PMD, a transparent and interpretable visual self-dialogue framework. Methodologically, we pioneer the integration of vision-language models (VLMs) with natural language inference (NLI) models to construct a frame-level fine-grained benchmark; introduce rationale-guided reasoning generation; and design coherence-aware fine-tuning and decoding strategies to automatically produce visualization-supported reasoning chains for decision-making. Our contributions are threefold: (1) the first self-dialogue attribution paradigm for procedural error detection; (2) an NLI-based automatic evaluation framework for rational coherence, enabling multidimensional, quantifiable assessment; and (3) empirical validation demonstrating significant improvements in detection accuracy, reasoning coherence, and inference efficiency—while uncovering a fundamental accuracy-coherence trade-off and identifying key optimization directions.

Technology Category

Application Category

📝 Abstract
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that while VLMs struggle off-the-shelf, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods- though not without tradeoff. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
Problem

Research questions and friction points this paper is trying to address.

Classify human task execution success via egocentric video
Generate visual self-dialog rationales for transparent decisions
Improve accuracy and coherence of procedural mistake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating visual self-dialog rationales for PMD
Using vision-and-language models for frame-based PMD
Leveraging NLI for automated coherence metrics
🔎 Similar Papers
No similar papers found.
Shane Storks
Shane Storks
Research Fellow, University of Michigan
artificial intelligencenatural language processingcommonsense reasoning
I
Itamar Bar-Yossef
University of Michigan, Ann Arbor, Michigan, USA
Yayuan Li
Yayuan Li
University of Michigan
AR-AI Instructional AgentInstructional VideosVideo GenerationVision and Language
Z
Zheyuan Zhang
University of Michigan, Ann Arbor, Michigan, USA
J
Jason J. Corso
University of Michigan, Ann Arbor, Michigan, USA
J
Joyce Chai
University of Michigan, Ann Arbor, Michigan, USA