Repository Structure-Aware Training Makes SLMs Better Issue Resolver

📅 2024-12-26

📈 Citations: 0

✨ Influential: 0

career value

130K/year

🤖 AI Summary

Small language models (SLMs) underperform significantly compared to large language models (LLMs) on repository-level code understanding and repair tasks, hindering their deployment in privacy-sensitive and cost-constrained settings. Method: We propose Repository Structure-Aware Training (ReSAT), a novel training paradigm featuring dual-track collaborative learning: (i) multi-level progressive localization and (ii) context-driven edit generation. ReSAT leverages large-scale open-source Issue–PR pairs to construct structured training data, enabling unified modeling of multi-granularity code localization and fine-grained edit operations. Contribution/Results: On SWE-Bench-verified and RepoQA benchmarks, ReSAT substantially improves SLMs’ accuracy in resolving repository-level issues and enhances their long-context comprehension capabilities. Our approach establishes a new paradigm for lightweight, yet effective, code intelligence—enabling high-fidelity repository-scale reasoning without reliance on LLM-scale parameters.

Technology Category

Application Category

📝 Abstract

Language models have been applied to various software development tasks, but the performance varies according to the scale of the models. Large Language Models (LLMs) outperform Small Language Models (SLMs) in complex tasks like repository-level issue resolving, but raise concerns about privacy and cost. In contrast, SLMs are more accessible but under-perform in complex tasks. In this paper, we introduce ReSAT (Repository Structure-Aware Training), construct training data based on a large number of issues and corresponding pull requests from open-source communities to enhance the model's understanding of repository structure and issue resolving ability. We construct two types of training data: (1) localization training data, a multi-level progressive localization data to improve code understanding and localization capability; (2) code edit training data, which improves context-based code editing capability. The evaluation results on SWE-Bench-verified and RepoQA demonstrate that ReSAT effectively enhances SLMs' issue-resolving and repository-level long-context understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Small Software Language Models

Complex Programming Problems

Performance Gap with Large Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReSAT

Software Language Models (SLMs)

Code Repository Structure Understanding

🔎 Similar Papers

No similar papers found.