Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy
By: Shamse Tasnim Cynthia, Banani Roy
Potential Business Impact:
Finds the right science tools for any job.
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.
Similar Papers
From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
Software Engineering
AI helps scientists build DNA analysis tools faster.
From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
Software Engineering
Helps scientists build complex data tools easily.
Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking
Information Retrieval
Finds the best science papers for your questions.