🤖 AI Summary
Existing LLM code evaluation faces two key challenges: high cost of test suite construction and escalating data contamination risk. This paper proposes BIS, a prompt-centric, execution-free evaluation framework grounded in importance sampling theory; it predicts LLM generation performance on novel benchmarks solely from prompt distribution analysis. BIS is the first method to enable performance prediction without ground-truth labels or code execution, reusing existing annotated data via importance reweighting—thereby mitigating data contamination and enabling rapid benchmark validation. Its core design integrates an importance-weighted variational autoencoder, weight truncation, and marginal expectation estimation to ensure prediction stability. Evaluated across 8,000 assessment points, BIS achieves mean absolute errors of only 1.1% for code correctness and 2.15% for pass@1, demonstrating both high accuracy and strong generalization.
📝 Abstract
With the rapid advancement of large language models , code generation has become a key benchmark for evaluating LLM capabilities. However, existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination, which undermines the reliability of benchmark-based evaluations. In this paper, we propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks. Rather than executing generated code, BIS estimates performance metrics by analyzing the prompt distribution alone. Built on importance sampling theory and implemented using Importance Weighted Autoencoders, our method reweights samples from existing annotated benchmarks to estimate performance on new, unseen benchmarks. To stabilize the estimation, we introduce weight truncation strategies and compute marginal expectations across the fitted distributions. BIS serves as a complementary tool that supports benchmark development and validation under constrained resources, offering actionable and quick feedback for prompt selection and contamination assessment. We conduct extensive experiments involving 8,000 evaluation points across 4 CodeLlama models and 9 diverse benchmarks. Our framework achieves an average absolute prediction error of 1.1% for code correctness scores, with best- and worst-case errors of 0.3% and 1.9%, respectively. It also generalizes well to other metrics, attaining average absolute errors of 2.15% for pass@1. These results demonstrate the reliability and broad applicability of BIS, which can significantly reduce the cost and effort of benchmarking LLMs in code-related tasks.