ReCatcher: Towards LLMs Regression Testing for Code Generation

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Frequent updates to large language models (LLMs) for code generation—such as fine-tuning, model merging, and version upgrades—often induce regressions in logical correctness, code quality, and runtime performance; yet existing work lacks a systematic regression evaluation framework. Method: We propose ReCatcher, the first multidimensional regression testing framework tailored for code-generation LLMs, enabling fine-grained detection of degradation across three dimensions: logical equivalence (via formal verification), static code quality (via static analysis), and runtime performance (via dynamic measurement)—all specialized for Python code generation. Contribution/Results: Experiments on mainstream LLMs demonstrate that ReCatcher reliably detects up to 80% of performance regressions and 50% of declines in error-handling capability, significantly enhancing the predictability and controllability of model evolution risks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.

Problem

Research questions and friction points this paper is trying to address.

Detects regressions in LLM-generated code quality

Evaluates correctness, performance, and error handling

Compares model updates to prevent degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic regression testing for LLM code generation

Compares models on correctness, quality, performance

Evaluates updates via fine-tuning, merging, releases

🔎 Similar Papers

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study