When Entropy Is Not Enough: Multi-Modal Classification of Encrypted and Compressed Data Fragments

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study addresses the challenge of reliably distinguishing encrypted from compressed data in extremely short byte sequences (512–2048 bytes), where traditional byte-statistics-based or unimodal approaches often fail. To overcome this limitation, the authors propose Triumvir, a novel architecture that breaks the unimodal assumption by integrating three complementary raw-byte representations—statistical, sequential, and spatial—and incorporates an uncertainty-aware multimodal ensemble mechanism. Evaluated on binary and multiclass classification tasks, Triumvir achieves accuracy gains of 4.5 and 6.4 percentage points, respectively. Ablation studies further demonstrate that the multimodal fusion alone contributes up to a 5-percentage-point improvement, underscoring the critical role of multi-perspective representations in identifying data types under low-information conditions.

📝 Abstract

Reliable identification of encrypted data fragments is essential in cybersecurity, with applications to ransomware detection, digital forensics, and large-scale data analysis. Distinguishing encrypted from compressed fragments is particularly challenging, as short fragments lack structural data and exhibit low statistical redundancy. Traditional statistical methods based on byte-level distributions show limited effectiveness on this task. Recent machine learning approaches improve performance by learning subtle patterns from raw bytes, but predominantly rely on single-modal representations, implicitly assuming that a single view of the data is sufficient for accurate classification. This paper shows that this assumption becomes a fundamental limitation in low-information settings, when only small fragments of data are available (512--2048 Bytes). We propose Triumvir, a multi-modal, uncertainty-aware ensemble architecture that integrates statistical, sequential, and spatial representations of raw byte fragments. Extensive experimental analysis demonstrates that Triumvir consistently outperforms state-of-the-art methods with gains of up to +4.5pp in binary and +6.4pp in multiclass classification. Ablation studies confirm that combining modalities is critical, yielding improvements of up to +5pp over partial configurations.

Problem

Research questions and friction points this paper is trying to address.

encrypted data

compressed data

data fragment classification

low-information setting

multi-modal classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal learning

encrypted data classification

uncertainty-aware ensemble