🤖 AI Summary
This study investigates the mechanisms and efficacy of collaborative large language model (LLM) agents in supporting scientific research, with a focus on the automated generation and evaluation of high-quality multiple-choice questions (MCQs). We design a human-coordinated multi-agent workflow that integrates PDF parsing, textbook structuring, content alignment, controllable MCQ generation, and automatic assessment based on 24 evaluation criteria. As the first empirical study to examine LLM agent collaboration in research tasks, it reveals a shift in the human role from content creation toward norm specification and process governance. Applying this framework to SAT Mathematics, we generated 1,071 MCQs of generally high quality; however, they still lag significantly behind expert-authored items in skill depth, cognitive engagement, difficulty calibration, and metadata alignment, despite excelling in surface-level qualities such as grammatical fluency.
📝 Abstract
Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.