CIFE: Code Instruction-Following Evaluation

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing code generation benchmarks emphasize functional correctness while neglecting models’ reliable adherence to developer-imposed constraints—such as robustness, formatting, and security. Method: We introduce C2A-Bench, the first Python code evaluation benchmark dedicated to constraint adherence, comprising 1,000 tasks and 13 atomized, objective human-AI co-creation constraints. We propose a novel four-stage human-AI collaborative constraint construction pipeline, design multi-granularity annotation and complementary adherence evaluation protocols, and define the C2A Score—a holistic metric jointly measuring functional correctness and multi-dimensional explicit constraints. Results: Experiments reveal significant bottlenecks in strict constraint adherence (39–66%) among mainstream LLMs—substantially lower than partial adherence rates (>90%)—demonstrating that intent alignment remains a critical challenge for trustworthy code generation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on code generation beyond correctness

Measures adherence to developer constraints like robustness and security

Highlights gap between partial and strict constraint satisfaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with 1,000 Python tasks and constraints

Four-stage human-LLM pipeline for constraint curation

C2A Score measuring correctness and constraint compliance

🔎 Similar Papers

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models