🤖 AI Summary
Existing large language models (LLMs) exhibit weak performance on real-world, repository-level code question answering—characterized by cross-file reasoning, multi-document context, and mixed natural-language/code inputs—and lack systematic, realistic evaluation benchmarks. Method: We introduce CoReQA, the first benchmark for repository-level QA grounded in authentic GitHub repositories, spanning 176 repositories across four programming languages. We further propose LLM-as-a-judge, a multidimensional automated evaluation framework that overcomes the semantic and code-relevance limitations of traditional metrics (e.g., BLEU). Contribution/Results: Empirical analysis reveals that state-of-the-art long-context LLMs—including GPT-4 and Claude 3—achieve only ~38% average accuracy on CoReQA, exposing fundamental bottlenecks in cross-file reasoning and structured repository understanding. CoReQA establishes a new standard for evaluating code-aware retrieval, context-efficient modeling, and AI-assisted development at the repository scale.
📝 Abstract
Large language models that enhance software development tasks, such as code generation, code completion, and code question answering (QA), have been extensively studied in both academia and the industry. The models are integrated into popular intelligent IDEs like JetBrains and Cursor. Current benchmarks for evaluating models' code comprehension capabilities primarily focus on code generation or completion, often neglecting QA, which is a crucial aspect of understanding code. Existing code QA benchmarks are derived from code comments with predefined patterns (e.g., CodeQA) or focus on specific domains, such as education (e.g., CS1QA). These benchmarks fail to capture the real-world complexity of software engineering and user requirements for understanding code repositories. To address this gap, we introduce CoReQA, a benchmark for Code Repository-level question answering, constructed from GitHub issues and comments from 176 popular repositories across four programming languages. Since questions and answers may include both natural language and code snippets, traditional evaluation metrics such as BLEU are inadequate for assessing repository-level QA performance. Thus, we provide an LLM-as-a-judge framework to evaluate QA performance from five aspects. Based on CoReQA, we evaluate the performance of three baselines, including two short-context models using generic retrieval strategies and one long-context model that utilizes the entire repository context. Evaluation results show that state-of-the-art proprietary and long-context models struggle to address repository-level questions effectively. Our analysis highlights the limitations of language models in assisting developers in understanding repositories and suggests future directions for improving repository comprehension systems through effective context retrieval methodologies.