Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenging problem of detecting rare, subtle, and ambiguous action errors in first-person videos by proposing a Understanding-Enhanced Model Collaboration Method (UE-MCM). The approach synergistically combines coarse-grained video-level and fine-grained clip-level analysis: a lightweight model branch performs joint global-local modeling, while a larger model branch focuses on fine-grained error discrimination, with their predictions adaptively fused via a lightweight collaboration gate. Innovatively, it integrates diffusion-based contrastive reconstruction-enhanced CLIP4CLIP with Qwen3-VL embeddings for joint encoding and introduces complementary optimization objectives—including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment—to effectively handle long-tailed data distributions. The resulting system achieves significantly improved detection accuracy for action errors in long-tail scenarios while maintaining efficient inference.

📝 Abstract

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

Problem

Research questions and friction points this paper is trying to address.

egocentric mistake detection

long-tailed distribution

action error recognition

first-person video analysis

instructional video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Collaboration

Long-Tailed Learning

Egocentric Video Understanding