๐ค AI Summary
Near-miss events (NMEs) exhibit extreme sparsity and zero-inflation, undermining the reliability of conventional count models. Method: We propose a Grouped Zero-Inflated Poisson (GZIP) model estimated via the EM algorithm, integrating ADAS warnings with multi-source in-vehicle sensor data to enable interpretable weekly driving risk prediction. The model automatically identifies heterogeneous driver subgroups, incorporates an offset term to account for exposure variation, and enhances contextual awareness through multi-sensor feature fusion. Contribution/Results: Evaluated on naturalistic driving data from 354 commercial drivers, GZIP significantly outperforms baseline modelsโyielding lower AIC/BIC, superior out-of-sample calibration, and robustness to misspecification of the number of latent groups. This provides a reliable, interpretable modeling framework for dynamic risk pricing and personalized safety interventions.
๐ Abstract
Driving behavior big data leverages multi-sensor telematics to understand how people drive and powers applications such as risk evaluation, insurance pricing, and targeted intervention. Usage-based insurance (UBI) built on these data has become mainstream. Telematics-captured near-miss events (NMEs) provide a timely alternative to claim-based risk, but weekly NMEs are sparse, highly zero-inflated, and behaviorally heterogeneous even after exposure normalization. Analyzing multi-sensor telematics and ADAS warnings, we show that the traditional statistical models underfit the dataset. We address these challenges by proposing a set of zero-inflated Poisson (ZIP) frameworks that learn latent behavior groups and fit offset-based count models via EM to yield calibrated, interpretable weekly risk predictions. Using a naturalistic dataset from a fleet of 354 commercial drivers over a year, during which the drivers completed 287,511 trips and logged 8,142,896 km in total, our results show consistent improvements over baselines and prior telematics models, with lower AIC/BIC values in-sample and better calibration out-of-sample. We also conducted sensitivity analyses on the EM-based grouping for the number of clusters, finding that the gains were robust and interpretable. Practically, this supports context-aware ratemaking on a weekly basis and fairer premiums by recognizing heterogeneous driving styles.