A Deterministic Forensic Preprocessing Framework for Heterogeneous Network Datasets: Formal Foundations, Implementation, and Empirical Validation

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenges in digital forensics posed by heterogeneous network data, whose incompatible schemas and timestamp formats hinder reliable evidence correlation and timeline reconstruction, while existing preprocessing methods suffer from poor reproducibility. To overcome these limitations, this work proposes a deterministic forensic preprocessing framework that transforms raw data into a standardized, reproducible form through three core operations: schema normalization, temporal normalization, and provenance tracking. The framework innovatively formalizes the preprocessing pipeline using set-theoretic constructs and rigorously proves its determinism, information preservation, and provenance completeness. Furthermore, it introduces a bounded-memory, chunk-based streaming architecture enabling scalable processing. Empirical evaluation on the UNSW-NB15, IoT-23, and TON_IoT datasets demonstrates 100% output consistency and efficient handling of datasets ranging from millions to hundreds of millions of records.

📝 Abstract

Digital forensic investigations increasingly depend on preprocessing heterogeneous network evidence from intrusion detection systems, IoT devices, and enterprise traffic logs. Incompatible schemas and timestamp formats hinder evidence correlation and timeline reconstruction, while current ad hoc approaches offer no mechanism to verify consistency across runs or analysis, creating reproducibility gaps that challenge evidence admissibility. This paper introduces a deterministic forensic preprocessing framework that converts heterogeneous network datasets into a reproducible canonical form. The framework formalises three preprocessing transformations: schema normalisation, temporal normalisation, and provenance tracking. These transformations are specified using set-theoretic definitions and supported by four theorems establishing determinism, information preservation, and provenance completeness. A chunk-based architecture provides O(c) bounded memory. Empirical evaluation across UNSW-NB15, IoT-23, and TON_IoT demonstrates 100% output consistency across repeated runs, robust temporal normalisation completeness over heterogeneous timestamp formats, and scalable performance from millions to hundreds of millions of records.

Problem

Research questions and friction points this paper is trying to address.

digital forensics

heterogeneous network data

evidence reproducibility

schema incompatibility

timestamp normalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

deterministic preprocessing

schema normalisation

temporal normalisation