Reverse Engineering User Stories from Code using Large Language Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the absence or obsolescence of user stories in legacy systems, this paper investigates a reverse-engineering approach for automatically recovering user stories from source code, with a focus on the feasibility of large language models (LLMs) and the critical role of prompt engineering. We conduct a systematic evaluation using five state-of-the-art LLMs (8B–70B parameters) and six prompt strategies on a dataset of 1,750 annotated C++ code snippets. Results demonstrate that even an 8B-parameter model achieves performance comparable to a 70B-parameter model when provided with only a single in-context example—highlighting the substantial gains enabled by effective prompting. Moreover, all models attain an average F1 score of 0.80 on code snippets with ≤200 non-commented logical lines of code (NLOC), confirming both high accuracy and practical applicability for real-world legacy system documentation.

Technology Category

Application Category

📝 Abstract
User stories are essential in agile development, yet often missing or outdated in legacy and poorly documented systems. We investigate whether large language models (LLMs) can automatically recover user stories directly from source code and how prompt design impacts output quality. Using 1,750 annotated C++ snippets of varying complexity, we evaluate five state-of-the-art LLMs across six prompting strategies. Results show that all models achieve, on average, an F1 score of 0.8 for code up to 200 NLOC. Our findings show that a single illustrative example enables the smallest model (8B) to match the performance of a much larger 70B model. In contrast, structured reasoning via Chain-of-Thought offers only marginal gains, primarily for larger models.
Problem

Research questions and friction points this paper is trying to address.

Recovering missing user stories from legacy source code automatically
Evaluating how prompt design impacts LLM output quality for code analysis
Assessing performance of various LLM sizes on reverse engineering tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs reverse engineer user stories from code
Prompt design with examples boosts small model performance
Chain-of-Thought reasoning offers marginal gains for large models
🔎 Similar Papers
No similar papers found.
M
Mohamed Ouf
Queen’s University, Kingston, Canada
H
Haoyu Li
Queen’s University, Kingston, Canada
M
Michael Zhang
Queen’s University, Kingston, Canada
Mariam Guizani
Mariam Guizani
Queen's University
Software EngineeringHCIEmpirical StudiesOpen Source