CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing programming assistant benchmarks predominantly evaluate code generation in isolation, neglecting multi-turn, context-aware conversational programming assistance within realistic project environments. While benchmarks such as InfiBench and StackEval attempt to address this gap, they remain limited to single-turn interactions, require extensive manual curation, and fail to reflect authentic codebase contexts. Method: We propose the first benchmark framework for multi-turn conversational programming assistance, automatically generating scalable, realistic test cases from GitHub issues and evaluating assistants in containerized, project-specific execution environments using Docker. Our approach integrates automated GitHub issue crawling and filtering, parameterized dataset configuration, and simulated user–assistant interaction testing. Contribution/Results: The benchmark comprises 3,286 real-world problems across 231 repositories. Evaluation of state-of-the-art LLM-based assistants reveals a stark performance drop—solving only up to 16.49% of problems in authentic codebases, compared to 70–83% on Stack Overflow—highlighting the critical need for environment-aware, multi-turn assessment.

Technology Category

Application Category

📝 Abstract
Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB's recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn programming assistance in realistic project environments
Automating dataset generation from GitHub issues for scalable benchmarks
Assessing LLM performance on complex project-specific coding questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated dataset generation from GitHub issues
Containerized codebases for realistic evaluation
Multi-turn assistance in project-specific contexts
🔎 Similar Papers
No similar papers found.