🤖 AI Summary
This study investigates feedback-loop disparities between internal and external human annotators in multi-turn retrieval-augmented generation (RAG) dialogue annotation tasks. Using a longitudinal empirical design—integrating iterative annotation rounds with annotator experience surveys—we systematically analyze trade-offs among dialogue quality, quantity, and diversity across the two annotator groups. Results reveal that tight internal feedback loops significantly improve dialogue quality but reduce quantity and lexical/structural diversity; conversely, looser external feedback enhances diversity at the cost of consistency and coherence. To reconcile these tensions, we propose a “dual-track collaborative annotation framework” that assigns differentiated roles and dynamically modulates feedback frequency, achieving Pareto-optimal balance between quality and diversity. This work presents the first empirically grounded, human-centered strategy for optimizing annotator allocation and annotation workflow design in complex RAG data construction.
📝 Abstract
Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.