🤖 AI Summary
This study addresses the lack of effective workflows and IDE tools that support end-to-end trust calibration for developers reviewing multi-file code changes generated by large language models (LLMs). In collaboration with JetBrains, the authors employed a double-diamond design process through participatory design to propose a three-tiered review workflow—comprising overview, file-level analysis, and code snippet inspection—centered on trust calibration, along with seven key design components. A high-fidelity, semi-interactive prototype was developed and evaluated, with results showing that the three-tiered workflow received significantly higher ratings than a neutral baseline. Notably, 63% of participants anticipated reduced review effort, and 52% reported a lower burden in assessing trustworthiness, demonstrating the framework’s effectiveness and potential for building AI-ready code review tools.
📝 Abstract
Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario.
Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task.
Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners.
Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development.
Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.