π€ AI Summary
Existing network traffic analysis tools (e.g., CICFlowMeter) suffer from critical limitations in flow delineation, feature extraction, and label consistency, undermining the reliability and reproducibility of intrusion detection systems (IDS). To address these issues, we propose HERAβa lightweight, open-source, end-to-end traffic processing framework. HERA is the first tool to support configurable feature sets and fine-grained flow labeling, integrating NetFlow/IPFIX parsing, customizable feature engineering, and flexible label mapping. Implemented in Python, it natively supports standard datasets such as UNSW-NB15. Experimental evaluation on UNSW-NB15 demonstrates >99.8% flow generation accuracy and 100% label consistency across all flows. HERA significantly enhances traffic data fidelity, usability, and extensibility, thereby establishing a high-fidelity, reproducible foundation for IDS research and development.
π Abstract
Cybersecurity threats highlight the need for robust network intrusion detection systems to identify malicious behaviour. These systems rely heavily on large datasets to train machine learning models capable of detecting patterns and predicting threats. In the past two decades, researchers have produced a multitude of datasets, however, some widely utilised recent datasets generated with CICFlowMeter contain inaccuracies. These result in flow generation and feature extraction inconsistencies, leading to skewed results and reduced system effectiveness. Other tools in this context lack ease of use, customizable feature sets, and flow labelling options. In this work, we introduce HERA, a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features. Validated and tested with the UNSW-NB15 dataset, HERA demonstrated accurate flow and label generation.