🤖 AI Summary
Multi-object tracking (MOT) in sports scenes faces challenges including rapid motion, frequent occlusions, and camera motion. Conventional detection-driven methods suffer from poor generalization, while segmentation-driven approaches struggle to model temporal trajectories effectively. This paper proposes McByte, a training-free MOT framework that introduces a novel paradigm eliminating the need for video-level fine-tuning or end-to-end training. McByte synergistically integrates temporally propagated masks—generated by pre-trained segmentation models (e.g., SAM or Mask R-CNN)—as strong association cues, and couples them with YOLO-based detectors within a detection-tracking architecture. This design significantly enhances robustness against motion blur, occlusion, and camera motion. McByte achieves state-of-the-art or near-state-of-the-art performance on SportsMOT, DanceTrack, SoccerNet-tracking 2022, and MOT17, demonstrating strong cross-domain generalization and architectural versatility.
📝 Abstract
Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at https://github.com/tstanczyk95/McByte.