AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

154K/year
🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to efficiently comprehend massive codebases due to the instability and ad hoc nature of existing query views. To overcome this, we propose AOCI (AI-Oriented Code Indexing), which introduces a novel symbol–semantics joint indexing mechanism to construct a structured blueprint of the codebase. This enables LLMs to grasp system architecture, dependencies, and key design decisions in a single pass. AOCI generates index entries pairing symbolic tags with semantic content for each code unit and supports automated incremental maintenance. Evaluated across 2,160 trials, AOCI significantly outperforms all deployable baselines (p < 0.001) and approaches oracle-level performance. In 19 real-world industrial tasks, it achieves zero defects while reducing token consumption by 4–130× compared to mainstream agent-based tools.
📝 Abstract
Large language models struggle with understanding codebases beyond a certain scale -- repositories with hundreds of thousands of lines of code. Existing methods -- retrieval, summarization, agent exploration -- each construct a different view at query time. The view varies between runs, and what persists is typically ad-hoc rather than systematic. This paper introduces AOCI (AI-Oriented Code Indexing): a symbolic-semantic repository representation -- a structured blueprint that an LLM can read in a single pass to gain a complete repository-level picture of the system's architecture, dependencies, and key design decisions before any task. An AOCI index consists of encoding rules followed by entries, with one entry per code unit (file or database table). Each entry pairs a symbolic tag with semantic content. The symbolic component provides architectural coordinates; the semantic component carries function, dependencies, and constraints. Together they form a consistent, stable representation of the entire system. Index maintenance is incremental: when code changes, only affected entries are regenerated under protocol rules. The AOCI Platform automates this process, keeping the blueprint aligned with the code. We evaluated AOCI on four projects across three LLMs and six context conditions (2,160 evaluations). AOCI outperforms all deployable baselines and ranks second only to the Oracle upper bound in overall accuracy. On 19 industrial tasks across five systems, AOCI produced zero final-state defects, while three mainstream agent-based tools introduced defects in 12 tasks and consumed 4--130$\times$ more tokens ($p < 0.001$). The advantage grows with task complexity.
Problem

Research questions and friction points this paper is trying to address.

code understanding
large language models
repository-scale
symbolic-semantic indexing
codebase comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

symbolic-semantic indexing
repository-scale code understanding
AI-oriented code indexing
incremental index maintenance
LLM-based code analysis
🔎 Similar Papers
No similar papers found.
J
Jinshi Liu
Xingyun Zhixue (Beijing) Technology Co., Ltd., Beijing, China
H
Hanying Zuo
Xingyun Zhixue (Beijing) Technology Co., Ltd., Beijing, China
C
Congyin Cao
AI Application and Innovation Lab, School of New Media, Peking University, Beijing, China
A
Anran Zhang
AI Application and Innovation Lab, School of New Media, Peking University, Beijing, China
Yixuan Liu
Yixuan Liu
AMD, Tsinghua University
Generative AI
X
Xinzhou Xie
AI Application and Innovation Lab, School of New Media, Peking University, Beijing, China