When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing evaluations of code assistants rely predominantly on static benchmarks, failing to capture the dynamic adaptability required in real-world human-AI collaboration. Method: We propose the first systematic interactive evaluation framework: perturbing static programming benchmarks, injecting diverse simulated user feedback (e.g., functional corrections, aesthetic refinements), and modeling realistic user interaction patterns. We conduct robustness assessments across 10 large language models for code on three benchmark datasets. Results: (1) Interactive feedback substantially alters relative model rankings; (2) feedback type drives divergent editing preferences—functional vs. aesthetic prioritization—and induces distinct quality sensitivity profiles; (3) models exhibit resilience to erroneous feedback, yet interaction fundamentally reshapes their behavioral trajectories. This work bridges the gap between static evaluation and practical collaborative coding, establishing a new paradigm for trustworthy deployment of code LLMs.

Technology Category

Application Category

📝 Abstract

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs. lower-quality feedback and (2) whether models prioritize aesthetic vs. functional edits. Our work aims to"re-evaluate"model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.

Problem

Research questions and friction points this paper is trying to address.

Re-evaluating Code LLMs with interactive feedback

Examining LLM collaboration with simulated users

Bridging gap between static and real-world coding evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive evaluation pipeline

Perturbed static coding benchmarks

Simulated user feedback integration

🔎 Similar Papers

No similar papers found.