OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

📅 2024-02-22

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 67

✨ Influential: 6

career value

185K/year

🤖 AI Summary

Current open-source large language models lack execution-closed-loop capability comparable to GPT-4 Code Interpreter, limiting their code generation accuracy and self-improvement capability. This paper introduces OpenCodeInterpreter—the first open-source executable code agent framework—integrating sandboxed code execution, real-runtime feedback, and GPT-4-synthesized human feedback to establish a multi-turn “generate–execute–feedback–revise” loop. We propose a novel two-stage training strategy combining supervised fine-tuning and reinforcement learning to achieve execution alignment. Additionally, we release a high-quality 68K-sample Code-Feedback multi-turn dialogue dataset. Evaluated on HumanEval and MBPP, OpenCodeInterpreter-33B achieves 83.2 (76.4 for the enhanced version); incorporating synthetic feedback further improves performance to 91.6 (84.6), approaching GPT-4’s level.

Technology Category

Application Category

📝 Abstract

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Code Generation

Performance Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenCodeInterpreter

Code-Feedback dataset

iterative code optimization

🔎 Similar Papers

Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency