Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the lack of effective workflows and IDE tools that support end-to-end trust calibration for developers reviewing multi-file code changes generated by large language models (LLMs). In collaboration with JetBrains, the authors employed a double-diamond design process through participatory design to propose a three-tiered review workflow—comprising overview, file-level analysis, and code snippet inspection—centered on trust calibration, along with seven key design components. A high-fidelity, semi-interactive prototype was developed and evaluated, with results showing that the three-tiered workflow received significantly higher ratings than a neutral baseline. Notably, 63% of participants anticipated reduced review effort, and 52% reported a lower burden in assessing trustworthiness, demonstrating the framework’s effectiveness and potential for building AI-ready code review tools.

📝 Abstract

Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario. Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task. Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners. Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development. Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.

Problem

Research questions and friction points this paper is trying to address.

trust-calibration

code review

LLM-generated code

multi-file changes

developer workflow

Innovation

Methods, ideas, or system contributions that make the work stand out.

trust-calibration

multi-file code review

LLM-generated code