🤖 AI Summary
To address the misalignment with user needs, poor scalability, and weak interpretability in evaluating LLM-driven software engineering (SE) dialogue assistants, this paper proposes the first interdisciplinary evaluation framework integrating human-computer interaction (HCI) principles with AI-based automated assessment. Methodologically, it systematically unifies HCI-driven requirement modeling with LLM behavioral modeling, designs a developer-centered automated metric suite, and ensures reliability through multi-dimensional validity validation. Key contributions include: (1) articulating six core human-centered requirements and associated challenges for automated SE assistant evaluation; (2) establishing a scalable, reproducible, and user-aligned evaluation paradigm; and (3) providing both theoretical foundations and an open, implementable toolchain. The framework significantly enhances the practicality, transparency, and human-factor compatibility of SE assistant evaluation.
📝 Abstract
As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional human-centered methods for evaluating LLM-based tools at scale raise the need for automatic evaluation. In this paper, we advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation of LLM-based conversational SE assistants. We identify requirements for such evaluation and challenges down the road, working towards a framework that ensures these assistants are designed and deployed in line with user needs.