GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GDC users—especially novices—struggle to precisely define patient cohorts from hundreds of heterogeneous metadata fields. Method: This paper introduces a natural language–driven cohort construction framework powered by a localized, open-source large language model (LLM), which automatically translates free-text cohort descriptions into standardized GDC filter queries and supports iterative, interactive refinement. Technically, the system integrates semantic understanding, Docker-based containerization, Hugging Face model hosting, and GDC API interoperability to deliver an end-to-end, privacy-preserving pipeline. Contribution/Results: We release GDC Cohort Copilot—an open-source tool that substantially lowers the barrier to entry for GDC cohort building. Empirical evaluation shows it achieves higher cohort generation accuracy than GPT-4o while ensuring data confidentiality, reproducibility, and usability for biomedical researchers.

Technology Category

Application Category

📝 Abstract
Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.
Problem

Research questions and friction points this paper is trying to address.

Helps users find specific cohort descriptors in GDC
Converts natural language descriptions into GDC cohort filters
Provides an interactive tool for refining generated cohorts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses natural language for cohort description
Leverages locally-served open-source LLM
Interactive UI for cohort refinement
Steven Song
Steven Song
University of Chicago
machine learning for healthcare
A
Anirudh Subramanyam
Center for Translational Data Science, University of Chicago, Chicago, IL
Z
Zhenyu Zhang
Center for Translational Data Science, University of Chicago, Chicago, IL
A
Aarti Venkat
Center for Translational Data Science, University of Chicago, Chicago, IL; Section of Biomedical Data Science, Department of Medicine, University of Chicago, Chicago, IL
R
Robert L. Grossman
Center for Translational Data Science, University of Chicago, Chicago, IL; Department of Computer Science, University of Chicago, Chicago, IL; Section of Biomedical Data Science, Department of Medicine, University of Chicago, Chicago, IL