🤖 AI Summary
This study addresses the challenge of real-time, high-accuracy platform crowd density estimation in urban rail transit. Methodologically, it proposes a novel density estimation algorithm integrating semantic segmentation with linear optimization: it models passenger depth distribution from deep learning–generated segmentation maps and employs linear optimization for precise counting. The framework fuses multiple state-of-the-art models—including YOLOv11, RT-DETRv2, Crowd-ViT, DeepLabV3, and APCGG—and incorporates a privacy-preserving video processing mechanism. Evaluated on over 600 hours of real-world CCTV footage from the Washington Metropolitan Area Transit Authority (WMATA), the method achieves centimeter-level spatial granularity and sub-second temporal responsiveness without auxiliary sensors. It reduces mean absolute error by 23.7% over existing benchmarks. The approach delivers a robust, production-ready visual perception foundation for intelligent dispatching, emergency response, and passenger service enhancement.
📝 Abstract
Accurately estimating urban rail platform occupancy can enhance transit agencies' ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.