REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the scarcity of systematic, large-scale empirical data from real-world reverse engineering (RE) practice, which has hindered rigorous analysis of practitioners’ core challenges and knowledge gaps. To bridge this gap, we introduce and publicly release REStack—a novel dataset comprising over 12,000 RE-related question-answer pairs collected across 15 years from Stack Overflow and Reverse Engineering Stack Exchange. By integrating a genetic algorithm–optimized LDA topic model, manual annotation, and community interaction signals (e.g., unanswered question rates and response times), we identify 23 semantic topics grouped into six high-level categories. Our analysis reveals that debugging, decompilation, and system-level analysis dominate current discourse, whereas memory, firmware, and file format analysis are notably more difficult and exhibit lower resolution rates. REStack provides a foundational resource for developing AI-assisted RE tools, improving education, and enabling reproducible empirical research.

📝 Abstract

Reverse engineering (RE) is a critical activity in software engineering and cybersecurity, supporting tasks such as malware analysis, vulnerability discovery, legacy system maintenance, and firmware inspection. Despite its importance, there is limited empirical understanding of the challenges, topics, and knowledge gaps faced by RE practitioners in real-world settings, and no publicly available dataset has systematically captured RE discussions from developer Q&A forums. In this paper, we present REStack, a large-scale dataset of RE discussions collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site. The dataset comprises over 12,000 RE-related posts spanning more than 15 years. Using Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA)-based hyperparameter optimization, followed by manual topic labeling, we identify 23 semantically coherent RE topics grouped into six high-level thematic categories. The dataset is further enriched with metadata and difficulty indicators derived from community interaction signals, such as unanswered rates and response times. Our analysis reveals that RE discussions are predominantly practical and task-oriented, with strong emphasis on debugging, decompilation, and system-level analysis, while topics related to memory, firmware, and file format analysis exhibit high difficulty and unresolved rates. Beyond empirical characterization, REStack provides a reusable resource for empirical studies, educational research, and the development and evaluation of AI- and LLM-based developer assistance tools for RE. By releasing the dataset and accompanying scripts, this work aims to facilitate reproducible research and advance data-driven support for RE practice.

Problem

Research questions and friction points this paper is trying to address.

reverse engineering

empirical understanding

developer Q&A forums

knowledge gaps

dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Engineering

Large-scale Dataset

Topic Modeling