🤖 AI Summary
To address the challenge of cross-repository reuse of neural network modules in open-source PyTorch codebases, this paper introduces NN-RAG—a retrieval-augmented generation (RAG) system that constructs a searchable, executable modular knowledge base. Methodologically, it proposes scope-aware dependency parsing and import-preserving reconstruction, coupled with a validation-gated refinement strategy to enable architectural pattern transfer and closed-loop extraction of executable modules. It further integrates multi-level deduplication (exact, lexical, and structural), automated correctness validation, and dataset registration. Evaluated across 19 mainstream repositories, NN-RAG extracts 1,289 candidate modules, of which 941 (73.0%) pass rigorous validation and contribute 72% of the novel network architectures in the LEMUR benchmark—marking the first large-scale, high-fidelity, and verifiable reuse of neural network components across heterogeneous codebases.
📝 Abstract
Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.