(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

πŸ“… 2026-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the frequent errors in AI-assisted social science research arising from poorly structured human–AI collaboration, particularly in hypothesis generation, model specification, and conclusion drafting. To mitigate these issues, the authors propose the Human-in-the-Loop Economic Research (HLER) framework, which formalizes cognitive division of labor through pre-commitment protocols, sequenced decision-making, accountability mechanisms, and attention allocation. Within this architecture, large language models are restricted to reasoning tasks, while data processing and estimation follow fully deterministic procedures, and three human decision checkpoints govern the workflow. As the first work to formalize human–AI collaboration into an experimentally testable framework, HLER reduces critical failure rates from 72% to 16% across 280 experiments (p<0.001). Ablation studies confirm that deterministic computation and human checkpoints each contribute independently and complementarily, with pronounced gains especially evident under low-representativeness data conditions.
πŸ“ Abstract
Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.
Problem

Research questions and friction points this paper is trying to address.

AI-assisted research
reliability
human-in-the-loop
large language models
social science
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-Loop
Large Language Models
Deterministic Computation
Research Reliability
Cognitive Labor Allocation