🤖 AI Summary
Existing large language model (LLM)-based programming assistants lack systematic evaluation regarding their ability to adopt and support empirically validated software engineering (SE) practices. Method: This study conducts the first comprehensive assessment of five mainstream LLM programming assistants against 17 rigorously validated SE practice claims. We design a prompt-engineering–based claim verification framework, integrating multi-model comparative analysis and mapping to an SE knowledge base. Contribution/Results: All models exhibit poor and unstable support for most claims: only 12.3% of responses contain verifiable evidence, while widespread issues include ambiguous evidential grounding, low credibility, and misalignment with empirical practice. To address this gap, we introduce “evidence alignment”—a novel evaluation dimension quantifying the degree to which model outputs conform to empirical SE findings. This metric provides both a methodological foundation and empirical basis for assessing LLM trustworthiness and guiding evidence-driven design of programming assistants.
📝 Abstract
Recent innovations in artificial intelligence (AI), primarily powered by large language models (LLMs), have transformed how programmers develop and maintain software -- leading to new frontiers in software engineering (SE). The advanced capabilities of LLM-based programming assistants to support software development tasks have led to a rise in the adoption of LLMs in SE. However, little is known about the evidenced-based practices, tools and processes verified by research findings, supported and adopted by AI programming assistants. To this end, our work conducts a preliminary evaluation exploring the beliefs and behaviors of LLM used to support software development tasks. We investigate 17 evidence-based claims posited by empirical SE research across five LLM-based programming assistants. Our findings show that LLM-based programming assistants have ambiguous beliefs regarding research claims, lack credible evidence to support responses, and are incapable of adopting practices demonstrated by empirical SE research to support development tasks. Based on our results, we provide implications for practitioners adopting LLM-based programming assistants in development contexts and shed light on future research directions to enhance the reliability and trustworthiness of LLMs -- aiming to increase awareness and adoption of evidence-based SE research findings in practice.