Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the mechanisms and efficacy of collaborative large language model (LLM) agents in supporting scientific research, with a focus on the automated generation and evaluation of high-quality multiple-choice questions (MCQs). We design a human-coordinated multi-agent workflow that integrates PDF parsing, textbook structuring, content alignment, controllable MCQ generation, and automatic assessment based on 24 evaluation criteria. As the first empirical study to examine LLM agent collaboration in research tasks, it reveals a shift in the human role from content creation toward norm specification and process governance. Applying this framework to SAT Mathematics, we generated 1,071 MCQs of generally high quality; however, they still lag significantly behind expert-authored items in skill depth, cognitive engagement, difficulty calibration, and metadata alignment, despite excelling in surface-level qualities such as grammatical fluency.

Technology Category

Application Category

📝 Abstract
Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
scientific research
multiple-choice question generation
AI-orchestrated workflow
question evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents
AI-orchestrated workflow
MCQ generation
scientific research automation
AI research operations
🔎 Similar Papers
No similar papers found.