ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of accurately recovering the architecture of large-scale legacy software systems, which is hindered by architectural drift, missing relationships, and the context length limitations of large language models (LLMs). To overcome these issues, the authors propose a scalable, agent-based framework that integrates static analysis, adaptive code segmentation, and LLM-driven synthesis. The framework employs a context pruning mechanism to generate multi-perspective, business-aligned architectural views and leverages cross-repository dependency analysis to identify critical business modules. Experimental evaluation on representative large GitHub projects demonstrates that the proposed approach significantly outperforms existing baselines. Ablation studies confirm the positive impact of cross-repository context on architectural accuracy, and real-world case studies successfully reconstruct core business logic from legacy systems.

Technology Category

Application Category

📝 Abstract

Recovering accurate architecture from large-scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis to reconstruct multiview, business-aligned architectures from cross-repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules. Evaluations of typical large-scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production-level repositories, and a real-world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at https://github.com/panrusheng/arch-eval-benchmark.

Problem

Research questions and friction points this paper is trying to address.

legacy software

architecture recovery

architectural drift

Large Language Models

software architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Architecture Recovery

Large Language Models

Agent-based Framework