Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches extract program components in isolation, failing to reconstruct complete scientific workflows—thereby impeding research reproducibility and the advancement of “AI for Science.” This paper introduces the first end-to-end, paper-level workflow generation framework that integrates paragraph-level text mining with generative modeling to automatically construct structured, source-locatable, and visualizable research flowcharts from full-text papers. Methodologically, it employs SciBERT coupled with PU learning to identify descriptive paragraphs; leverages Flan-T5 with prompt engineering to generate workflow phrases; and applies few-shot learning via ChatGPT for stage classification and precise mapping to original text locations. Evaluated on NLP-domain papers, the framework achieves a paragraph identification F1-score of 0.977, ROUGE-1 of 0.454 for workflow phrase generation, and 95.8% accuracy in stage classification. It further enables the first systematic, longitudinal analysis of methodological evolution in NLP over the past two decades—revealing marked growth in data analysis and ablation studies.

Technology Category

Application Category

📝 Abstract
The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.
Problem

Research questions and friction points this paper is trying to address.

Automated generation of complete research workflows from papers
Mining full-text academic papers for structured workflow extraction
Improving research reproducibility through comprehensive workflow visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-text mining with paragraph-centric approach
PU Learning and SciBERT for workflow identification
Flan-T5 and ChatGPT for phrase generation and categorization
🔎 Similar Papers
No similar papers found.
H
Heng Zhang
School of Information Management, Central China Normal University, Wuhan 430079, China
Chengzhi Zhang
Chengzhi Zhang
Nanjing University of Science and Technology
Text MiningNatural Language ProcessingScience of Science