🤖 AI Summary
This study addresses the long-standing challenge of formalizing and automating “scientific taste”—the ability to judge the potential value of unverified research ideas—which has constrained the efficiency of scientific evaluation. The authors demonstrate for the first time that scientific taste can be learned from institutional traces such as journal acceptance decisions and encoded into a fine-tuned language model to automatically assess the quality of research proposals. In evaluations of management science proposals, this approach achieves 59% accuracy—significantly outperforming both expert panels (42%) and eleven state-of-the-art large language models (average 31%). Notably, its high-confidence predictions attain 100% accuracy and generalize effectively to economics, reaching 70% accuracy, thereby surpassing the performance limits of both human experts and existing models.
📝 Abstract
Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI's reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.