🤖 AI Summary
This work addresses the self-supervised discovery and temporal modeling of key steps in unlabeled procedural videos, tackling challenges including action repetition, background clutter, and step-order variability. We propose a Gromov-Wasserstein (GW) optimal transport framework augmented with structural priors: a contrastive regularization term is introduced to constrain the embedding space, preventing degenerate solutions and enhancing robustness in cross-video temporal alignment and key-step identification. Our method unifies self-supervised learning, inter-frame mapping modeling, and structural-aware GW distance computation. Evaluated on EgoProceL, ProceL, and CrossTask benchmarks, it significantly outperforms prior approaches—including OPEL—achieving state-of-the-art performance. Notably, this is the first work to explicitly incorporate structural priors into the GW optimization process. The resulting framework delivers an interpretable, robust paradigm for temporal modeling in procedural video understanding.
📝 Abstract
We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.