End-to-End Action Segmentation Transformer

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing action segmentation methods rely on pre-trained, task-heterogeneous frame features and lack explicit modeling of action segments. This paper proposes the first end-to-end frame-level video action segmentation framework. Methodologically, it (1) introduces a lightweight adapter for efficient backbone fine-tuning; (2) establishes a “segmentation-as-detection” paradigm, treating action proposals as fundamental units and jointly optimizing proposal generation and frame-wise label prediction; and (3) incorporates proposal-driven data augmentation alongside a coarse-grained downsampling–fine-grained propagation mechanism to explicitly model action segment structure. The framework achieves state-of-the-art performance across four benchmark datasets—GTEA, 50Salads, Breakfast, and Assembly-101. To foster reproducibility and further research, the model and source code will be publicly released.

Technology Category

Application Category

📝 Abstract
Existing approaches to action segmentation use pre-computed frame features extracted by methods which have been trained on tasks that are different from action segmentation. Also, recent approaches typically use deep framewise representations that lack explicit modeling of action segments. To address these shortcomings, we introduce the first end-to-end solution to action segmentation -- End-to-End Action Segmentation Transformer (EAST). Our key contributions include: (1) a simple and efficient adapter design for effective backbone fine-tuning; (2) a segmentation-by-detection framework for leveraging action proposals initially predicted over a coarsely downsampled video toward labeling of all frames; and (3) a new action-proposal based data augmentation for robust training. EAST achieves state-of-the-art performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101. The model and corresponding code will be released.
Problem

Research questions and friction points this paper is trying to address.

Existing methods lack end-to-end action segmentation solutions.
Current approaches miss explicit modeling of action segments.
Pre-computed frame features are not optimized for segmentation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-End Action Segmentation Transformer (EAST)
Segmentation-by-detection framework for frame labeling
Action-proposal based data augmentation for training
🔎 Similar Papers
No similar papers found.