Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This paper addresses online learning of multi-level temporal event structures in unsupervised video streams, aiming for human-like hierarchical and predictive event perception. We propose PARSE, the first framework integrating structured hierarchical prediction with uncertainty-aware generative learning. PARSE employs multi-scale recursive predictors to jointly model action boundaries and nested containment relations; event boundaries emerge dynamically via attention-based feedback and hierarchical prediction errors, yielding cognitively plausible temporal nesting. The method operates fully online—requiring no annotations—and processes streaming video in real time. Evaluated on Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods on H-GEBD, TED, and hierarchical F1, closely approaching offline methods while significantly improving temporal alignment accuracy and structural consistency.

Technology Category

Application Category

📝 Abstract

Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

Problem

Research questions and friction points this paper is trying to address.

Learn multiscale event structure from unsupervised streaming video

Segment video predictively and hierarchically like human perception

Achieve state-of-the-art temporal alignment and structural consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised learning of multiscale event structure from video

Hierarchical recurrent predictors with attention-based feedback

Event boundaries detected via prediction error peaks

🔎 Similar Papers

Unsupervised Episode Detection for Large-Scale News Events