More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the challenge of low-quality bug reports in crowdsourced testing, which impose substantial review burdens on developers and lack effective mechanisms to improve tester performance. The authors propose a large language model–based multi-agent evaluation framework that automatically assesses reports along three dimensions—textuality, sufficiency, and competitiveness—and integrates actionable feedback into human workflows. Through a four-phase controlled experiment combined with mixed-methods analysis, they provide the first empirical evidence that evaluative agents not only serve as post-hoc adjudicators but also function as in-process feedback sources, significantly enhancing the quality of report revisions, improving first-submission performance in subsequent tasks, and facilitating cross-application knowledge transfer. User studies further confirm the intelligibility and practical utility of the generated feedback.

📝 Abstract

Agentic AI is increasingly being integrated into software engineering workflows. In crowdsourced testing, however, the large volume and uneven quality of submitted reports still create a substantial review burden for developers. In prior work, we developed and validated a multi-agent assessment backbone based on the LLM-as-a-Judge paradigm. That backbone assesses reports along three dimensions--textuality, adequacy, and competitiveness--and was shown to align well with human consensus while substantially reducing assessment effort. Yet reliable automated judging does not by itself show whether agent outputs can improve human work when embedded into workflow. This paper studies that missing question in the context of crowdsourced testing. We investigate whether assessment-derived, actionable feedback can improve how testers revise reports, perform on later tasks, and transfer reporting practices across applications. To do so, we conducted a controlled four-stage human-subject study with 20 testers across three real-world applications. The results show that agent-generated feedback supports immediate improvements in revised reports, better first submissions on a new task after prior feedback exposure, and evidence of partial but meaningful transfer to a later application. A post-task questionnaire completed by 17 participants complements these artifact-based findings by suggesting that the feedback was generally understandable, acted upon in revision, and carried into later tasks, while also revealing remaining friction in specificity and execution. Overall, the study provides empirical evidence that, in the studied crowdsourced testing setting, assessment agents can serve not only as post-hoc judges but also as workflow-integrated feedback providers that support upstream report-quality improvement.

Problem

Research questions and friction points this paper is trying to address.

crowdsourced testing

agent-human interaction

actionable feedback

report quality improvement

workflow integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent-human collaboration

actionable feedback

crowdsourced testing