🤖 AI Summary
This work addresses the challenge of detecting fare evasion and fraudulent behaviors in public transportation, specifically targeting tailgating, unauthorized entry, and ticketing anomalies.
Method: We propose a real-time multimodal detection framework integrating visual and audio modalities. A Tensor Fusion Network (TFN) is employed to explicitly model unimodal and bimodal interactions, while ViViT and Audio Spectrogram Transformer (AST) are adopted for video and audio feature extraction, respectively.
Contribution/Results: Our key innovation lies in interpretable modeling of dynamic cross-modal relationships, departing from conventional black-box fusion paradigms. Evaluated on a proprietary dataset, the system achieves 89.5% accuracy, 87.2% precision, and 84.0% recall, with a 7.0% absolute improvement in F1-score and an 8.8% gain in recall over baseline methods. These results significantly enhance operational fairness and safety in public transit systems.
📝 Abstract
This research introduces a multimodal system designed to detect fraud and fare evasion in public transportation by analyzing closed circuit television (CCTV) and audio data. The proposed solution uses the Vision Transformer for Video (ViViT) model for video feature extraction and the Audio Spectrogram Transformer (AST) for audio analysis. The system implements a Tensor Fusion Network (TFN) architecture that explicitly models unimodal and bimodal interactions through a 2-fold Cartesian product. This advanced fusion technique captures complex cross-modal dynamics between visual behaviors (e.g., tailgating,unauthorized access) and audio cues (e.g., fare transaction sounds). The system was trained and tested on a custom dataset, achieving an accuracy of 89.5%, precision of 87.2%, and recall of 84.0% in detecting fraudulent activities, significantly outperforming early fusion baselines and exceeding the 75% recall rates typically reported in state-of-the-art transportation fraud detection systems. Our ablation studies demonstrate that the tensor fusion approach provides a 7.0% improvement in the F1 score and an 8.8% boost in recall compared to traditional concatenation methods. The solution supports real-time detection, enabling public transport operators to reduce revenue loss, improve passenger safety, and ensure operational compliance.