Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of real-time, high-accuracy platform crowd density estimation in urban rail transit. Methodologically, it proposes a novel density estimation algorithm integrating semantic segmentation with linear optimization: it models passenger depth distribution from deep learning–generated segmentation maps and employs linear optimization for precise counting. The framework fuses multiple state-of-the-art models—including YOLOv11, RT-DETRv2, Crowd-ViT, DeepLabV3, and APCGG—and incorporates a privacy-preserving video processing mechanism. Evaluated on over 600 hours of real-world CCTV footage from the Washington Metropolitan Area Transit Authority (WMATA), the method achieves centimeter-level spatial granularity and sub-second temporal responsiveness without auxiliary sensors. It reduces mean absolute error by 23.7% over existing benchmarks. The approach delivers a robust, production-ready visual perception foundation for intelligent dispatching, emergency response, and passenger service enhancement.

Technology Category

Application Category

📝 Abstract

Accurately estimating urban rail platform occupancy can enhance transit agencies' ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

Problem

Research questions and friction points this paper is trying to address.

Estimating urban rail platform crowding accurately

Using CCTV footage for real-time occupancy estimates

Comparing computer vision methods for crowd feature extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses YOLOv11, RT-DETRv2 for object detection

Employs Crowd-ViT for crowd classification

Applies DeepLabV3 with linear-optimization for segmentation

🔎 Similar Papers

No similar papers found.

Authors to Follow