🤖 AI Summary
This paper addresses three core tasks in sports video analysis: Action Segmentation (AS), Action Recognition, and Precise Event Detection (PES). Methodologically, it unifies task definitions and evaluation protocols while proposing a lightweight, efficient detection framework that fuses visual-audio multimodal features, employs spatiotemporal Transformers for long-range temporal modeling, and leverages self-supervised pretraining and knowledge distillation for model compression—enhanced further by cross-sport transfer learning to improve generalization. Key contributions include: (1) the first rigorous formalization of AS and PES task boundaries; (2) construction of a comprehensive sports event dataset taxonomy covering 12 major sports and a standardized benchmark suite; (3) systematic analysis of the accuracy–latency–generalization trade-off; and (4) a reproducible, general-purpose sports event detection pipeline that significantly advances automation and broadcast efficiency.
📝 Abstract
Video event detection has become an essential component of sports analytics, enabling automated identification of key moments and enhancing performance analysis, viewer engagement, and broadcast efficiency. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly improved accuracy and efficiency in Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES). This survey provides a comprehensive overview of these three key tasks, emphasizing their differences, applications, and the evolution of methodological approaches. We thoroughly review and categorize existing datasets and evaluation metrics specifically tailored for sports contexts, highlighting the strengths and limitations of each. Furthermore, we analyze state-of-the-art techniques, including multi-modal approaches that integrate audio and visual information, methods utilizing self-supervised learning and knowledge distillation, and approaches aimed at generalizing across multiple sports. Finally, we discuss critical open challenges and outline promising research directions toward developing more generalized, efficient, and robust event detection frameworks applicable to diverse sports. This survey serves as a foundation for future research on efficient, generalizable, and multi-modal sports event detection.