Patched RTC: evaluating LLMs for diverse software development tasks

📅 2024-07-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) lack efficient, interpretable, and human-evaluation-free methods for assessing performance on software development “outer-loop” tasks—e.g., vulnerability patching, code review, and documentation updates. Method: We propose Patched RTC, the first round-trip correctness (RTC) framework generalized to arbitrary LLMs and downstream tasks. It models cyclic consistency via bidirectional patch generation and verification, integrating consistency-aware prompting with multi-task patchflow reasoning tracing. Contribution/Results: The open-source Patchwork framework replaces the LLM-as-Judge paradigm, enabling automatic, quantitative evaluation of response consistency and robustness. Across cross-task benchmarks with GPT-3.5 and GPT-4, Patched RTC scores exhibit strong correlation with ground-truth accuracy, effectively distinguishing model capabilities from task difficulty—and guiding prompt optimization and model selection.

Technology Category

Application Category

📝 Abstract

This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on"outer loop"activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for diverse software development tasks

Measuring consistency and robustness of model responses

Providing an alternative to LLM-as-Judge paradigm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patched RTC evaluates LLMs for diverse software tasks

Self-evaluating framework measures consistency without human intervention

Open-source patchwork enables transparent evaluation during inference

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair