🤖 AI Summary
This work systematically evaluates Sakana.ai’s AI Scientist system for its autonomy in achieving Artificial General Research Intelligence (AGRI), specifically its capacity to close the hypothesis generation–experiment execution–paper writing loop.
Method: We construct an end-to-end LLM-driven research pipeline, integrating empirical benchmarking, hallucination detection, human double-blind review, and cost–time efficiency analysis.
Contribution/Results: We identify, for the first time, critical deficiencies in literature coverage, experimental success rate, and result veracity. We propose three novel paradigms: (1) an AGRI-specific evaluation benchmark; (2) a trust-aware citation verification mechanism; and (3) a standardized authorship framework. Results show the system generates human-like papers at low cost (~USD $5) and short latency (~hours), yet nearly half of its experiments fail, literature synthesis is shallow, results exhibit hallucination, and outputs often pass superficial peer review.
📝 Abstract
A major step toward Artificial General Intelligence (AGI) and Super Intelligence is AI's ability to autonomously conduct research - what we term Artificial General Research Intelligence (AGRI). If machines could generate hypotheses, conduct experiments, and write research papers without human intervention, it would transform science. Recently, Sakana.ai introduced the AI Scientist, a system claiming to automate the research lifecycle, generating both excitement and skepticism. We evaluated the AI Scientist and found it a milestone in AI-driven research. While it streamlines some aspects, it falls short of expectations. Literature reviews are weak, nearly half the experiments failed, and manuscripts sometimes contain hallucinated results. Most notably, users must provide an experimental pipeline, limiting the AI Scientist's autonomy in research design and execution. Despite its limitations, the AI Scientist advances research automation. Many reviewers or instructors who assess work superficially may not recognize its output as AI-generated. The system produces research papers with minimal human effort and low cost. Our analysis suggests a paper costs a few USD with a few hours of human involvement, making it significantly faster than human researchers. Compared to AI capabilities from a few years ago, this marks progress toward AGRI. The rise of AI-driven research systems requires urgent discussion within Information Retrieval (IR) and broader scientific communities. Enhancing literature retrieval, citation validation, and evaluation benchmarks could improve AI-generated research reliability. We propose concrete steps, including AGRI-specific benchmarks, refined peer review, and standardized attribution frameworks. Whether AGRI becomes a stepping stone to AGI depends on how the academic and AI communities shape its development.