ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately recovering the architecture of large-scale legacy software systems, which is hindered by architectural drift, missing relationships, and the context length limitations of large language models (LLMs). To overcome these issues, the authors propose a scalable, agent-based framework that integrates static analysis, adaptive code segmentation, and LLM-driven synthesis. The framework employs a context pruning mechanism to generate multi-perspective, business-aligned architectural views and leverages cross-repository dependency analysis to identify critical business modules. Experimental evaluation on representative large GitHub projects demonstrates that the proposed approach significantly outperforms existing baselines. Ablation studies confirm the positive impact of cross-repository context on architectural accuracy, and real-world case studies successfully reconstruct core business logic from legacy systems.

Technology Category

Application Category

📝 Abstract
Recovering accurate architecture from large-scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis to reconstruct multiview, business-aligned architectures from cross-repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules. Evaluations of typical large-scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production-level repositories, and a real-world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at https://github.com/panrusheng/arch-eval-benchmark.
Problem

Research questions and friction points this paper is trying to address.

legacy software
architecture recovery
architectural drift
Large Language Models
software architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Architecture Recovery
Large Language Models
Agent-based Framework
Cross-repository Analysis
Legacy Software
🔎 Similar Papers
No similar papers found.
R
Rusheng Pan
HiThink Research, Hanzhou, China
B
Bingcheng Mao
HiThink Research, Hanzhou, China
Tianyi Ma
Tianyi Ma
Hithink RoyalFlush Information Network Co., Ltd.
Machine LearningLarge Language ModelsRecommender System
Z
Zhenhua Ling
University of Science and Technology of China, Hefei, China