Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the systemic limitations of large language models (LLMs) in performing end-to-end autonomous scientific research, which hinder their ability to execute complete scientific workflows. To overcome this, the authors propose a six-stage multi-agent research pipeline that decomposes the scientific process into collaboratively executed subtasks, augmented by both human and multi-AI review mechanisms. Through four end-to-end experiments generating machine learning papersβ€”only one of which succeeded and was accepted at Agents4Science 2025β€”the study systematically identifies six failure modes of LLMs in long-horizon scientific tasks and derives four design principles for building robust AI scientist systems. The project releases all prompts, artifacts, and outputs, providing an empirical foundation and a reproducible framework for future AI-driven scientific research.

Technology Category

Application Category

πŸ“ Abstract
We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1
Problem

Research questions and friction points this paper is trying to address.

large language models
autonomous research
scientific workflow
AI failure modes
scientific discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous scientific discovery
LLM agents
scientific workflow automation
AI failure modes
AI scientist design principles
πŸ”Ž Similar Papers
No similar papers found.