🤖 AI Summary
This study addresses root-cause analysis in aviation safety by systematically investigating latent topics, semantic associations, and recurrent patterns within National Transportation Safety Board (NTSB) accident narratives. Methodologically, it conducts the first comparative evaluation of five unsupervised models—LDA, pLSA, LSA, NMF, and K-means—on aviation accident texts; LDA achieves the highest coherence score (0.597), establishing it as the optimal topic modeling approach. A hybrid framework integrating LDA-based topic modeling with K-means clustering enables dual-granularity analysis (“topic–accident”). Results identify high-frequency safety themes—including human factors, weather-related influences, and mechanical failures—and uncover cross-thematic common risk patterns, yielding interpretable, actionable insights for safety interventions. The proposed framework significantly enhances both the precision and traceability of text-driven risk identification in aviation safety analysis.
📝 Abstract
Aviation safety is a global concern, requiring detailed investigations into incidents to understand contributing factors comprehensively. This study uses the National Transportation Safety Board (NTSB) dataset. It applies advanced natural language processing (NLP) techniques, including Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLS$A$), and K-means clustering. The main objectives are identifying latent themes, exploring semantic relationships, assessing probabilistic connections, and cluster incidents based on shared characteristics. This research contributes to aviation safety by providing insights into incident narratives and demonstrating the versatility of NLP and topic modelling techniques in extracting valuable information from complex datasets. The results, including topics identified from various techniques, provide an understanding of recurring themes. Comparative analysis reveals that LDA performed best with a coherence value of 0.597, pLS$A$ of 0.583, LSA of 0.542, and NMF of 0.437. K-means clustering further reveals commonalities and unique insights into incident narratives. In conclusion, this study uncovers latent patterns and thematic structures within incident narratives, offering a comparative analysis of multiple-topic modelling techniques. Future research avenues include exploring temporal patterns, incorporating additional datasets, and developing predictive models for early identification of safety issues. This research lays the groundwork for enhancing the understanding and improvement of aviation safety by utilising the wealth of information embedded in incident narratives.