🤖 AI Summary
Galaxy’s current keyword-based workflow retrieval fails to capture users’ semantic intent, resulting in low recall—especially for long-tail and ambiguous queries. To address this, we propose a task-aware two-stage semantic retrieval framework: (1) an initial candidate filtering stage using dense vector retrieval, followed by (2) a generative re-ranking stage leveraging large language models (GPT-4o or Mistral-7B) to perform task-aligned semantic reordering—the first such approach for bioinformatics workflows in Galaxy. To enable rigorous evaluation, we construct the first Galaxy workflow benchmark dataset with topic annotations derived via BERTopic modeling and design synthetic task-oriented queries for end-to-end assessment. Experiments demonstrate significant improvements in top-k accuracy and relevance, particularly under ambiguous queries. The framework has been integrated into a Galaxy prototype system, confirming its practical feasibility.
📝 Abstract
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.