🤖 AI Summary
To address critical challenges—including low reliability, loose control, and poor interpretability—in LLM-driven scientific experiment automation, this paper proposes Curie, an AI agent framework. Methodologically, Curie introduces a novel collaborative architecture integrating *internal/external rigor modules* with an *experimental knowledge module*, ensuring end-to-end rigorous execution; constructs the first scientific experiment benchmark comprising 46 tasks derived from real publications and open-source projects; and synergistically combines multi-agent coordination, structured experimental planning, causal-reasoning guidance, knowledge-graph-enhanced retrieval, and LLM self-verification. Empirically, Curie achieves a 3.4× improvement in accuracy over the strongest baseline on experimental question answering. All code is publicly released.
📝 Abstract
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$ imes$ improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.