Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current code-oriented large language model (LLM) benchmarks rely predominantly on single-prompt templates, rendering evaluations highly susceptible to prompt sensitivity and thus unreliable. This work is the first to systematically expose severe performance volatility—exceeding 20% accuracy fluctuations—and cross-model rank reversal induced by minor semantic-preserving prompt variations in the code domain. Method: We propose a semantics-faithful, structure-constrained prompt mutation framework that generates 100 semantically equivalent prompt variants. We conduct large-scale evaluation across eight code-generation and understanding tasks and ten open-source LLMs, integrating semantic-aware mutation, statistical significance testing, and both relative and absolute stability metrics. Contribution/Results: Our empirical study provides critical evidence for benchmark robustness deficits and delivers a reproducible methodology to quantify and mitigate prompt sensitivity. It establishes foundational guidelines for developing reliable, stable code LLM benchmarks.

Technology Category

Application Category

📝 Abstract

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities. While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code benchmarks. We first propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure as much as possible. Based on the framework, we conduct extensive experiments across eight code benchmark tasks on 10 representative open-source LLMs, with each task featuring 100 semantically similar prompt templates. We then analyze the evaluation results using various statistical metrics, focusing on both absolute and relative model performance. Our findings suggest that even slight prompt variations can lead to significant shifts in performance. Additionally, we observe that such variations can introduce inconsistencies in the performance rankings across different models. These insights highlight the need for considering prompt sensitivity when designing future code benchmarks, to ensure more reliable and accurate evaluation of LLM capabilities.

Problem

Research questions and friction points this paper is trying to address.

Investigates prompt sensitivity in code benchmarks for LLMs

Proposes framework to modify prompts while preserving semantics

Analyzes performance variations due to prompt changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework modifies prompts preserving semantics and structure

Extensive experiments with 100 similar prompts per task

Analyzes performance shifts using statistical metrics

🔎 Similar Papers

What can Large Language Models Capture about Code Functional Equivalence?