Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current automated program repair (APR) methods over-rely on static analysis while neglecting dynamic runtime behavior, limiting their ability to guide large language models (LLMs) toward accurate fixes. Method: This paper presents the first systematic investigation into how program execution traces enhance LLM-based repair. We propose a trajectory-injection prompting strategy that structurally incorporates dynamic execution information into LLM inputs while maintaining可控 computational complexity. Contribution/Results: Extensive evaluation across six dataset–model combinations—including controlled ablation studies and probing analyses—demonstrates the efficacy boundary of execution traces: significant accuracy improvements in two configurations; consistent superiority over trajectory-free baselines and lightweight fine-tuning approaches. Our core contribution is establishing execution traces as a novel, effective signal for LLM-based program understanding, thereby enabling a scalable, dynamically aware APR paradigm.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show promising performance on various programming tasks, including Automatic Program Repair (APR). However, most approaches to LLM-based APR are limited to the static analysis of the programs, while disregarding their runtime behavior. Inspired by knowledge-augmented NLP, in this work, we aim to remedy this potential blind spot by augmenting standard APR prompts with program execution traces. We evaluate our approach using the GPT family of models on three popular APR datasets. Our findings suggest that simply incorporating execution traces into the prompt provides a limited performance improvement over trace-free baselines, in only 2 out of 6 tested dataset / model configurations. We further find that the effectiveness of execution traces for APR diminishes as their complexity increases. We explore several strategies for leveraging traces in prompts and demonstrate that LLM-optimized prompts help outperform trace-free prompts more consistently. Additionally, we show trace-based prompting to be superior to finetuning a smaller LLM on a small-scale dataset; and conduct probing studies reinforcing the notion that execution traces can complement the reasoning abilities of the LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing program repair by incorporating execution traces
Evaluating effectiveness of traces in LLM-based program repair
Optimizing trace usage to improve LLM repair performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmenting APR prompts with execution traces
Using LLM-optimized prompts for better performance
Comparing trace-based prompting with finetuning small LLMs
🔎 Similar Papers
No similar papers found.
Mirazul Haque
Mirazul Haque
Research Scientist, JP Morgan AI Research
Adversarial Machine LearningSoftware Engineering
Petr Babkin
Petr Babkin
J.P. Morgan AI Research
F
Farima Farmahinifarahani
J. P. Morgan AI Research, Palo Alto
M
Manuela Veloso
J. P. Morgan AI Research, New York