|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- openai-community/gpt2 |
|
pipeline_tag: text-generation |
|
--- |
|
<!-- differ per variant: base_model metadata, title, intro par 2, and code example --> |
|
# GDC Cohort LLM - GPT2 / User + 1M Synthetic data |
|
|
|
**GDC Cohort LLM** is a language model which translates natural language descriptions of patient cohorts from the NCI Genomic Data Commons (GDC) into the structured JSON cohort filters used by GDC for search, retrieval, and analysis of cancer genomic data. |
|
|
|
**`gdc-cohort-llm-gpt2-s1M`** is a variant of the GDC Cohort LLM which trains a GPT2 model over user-derived and 1M synthetically sampled GDC cohort filters. This model is adapted from the pretrained weights of [`openai-community/gpt2`](https://huggingface.co/openai-community/gpt2). |
|
|
|
[**GDC Cohort Copilot**](https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot) is the corresponding web app to the model running on HuggingFace Spaces and specifically utilizes the `gdc-cohort-llm-gpt2-s1M` version of GDC Cohort LLM. Full details of our model development provided in [our paper](https://arxiv.org/abs/2507.02221) and [GitHub repo](https://github.com/uc-cdis/gdc-cohort-copilot). |
|
|
|
## Model Variations |
|
|
|
| GDC Cohort LLM version | HuggingFace Link | Base Model | Training Data | Note | |
|
|-----------------------------------|-------------------------------------------------------------------------------------------------------|---------------------------|----------------------------|--------------------------------------| |
|
| GPT2 / User data | [uc-ctds/gdc-cohort-llm-gpt2-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-u) | openai-community/gpt2 | User Data | | |
|
| GPT2 / User + 100K Synthetic data | [uc-ctds/gdc-cohort-llm-gpt2-s100K](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-s100K) | openai-community/gpt2 | User + 100K Synthetic Data | | |
|
| GPT2 / User + 1M Synthetic data | [uc-ctds/gdc-cohort-llm-gpt2-s1M](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-s1M) | openai-community/gpt2 | User + 1M Synthetic Data | Deployed with **GDC Cohort Copilot** | |
|
| BART / User data | [uc-ctds/gdc-cohort-llm-bart-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-bart-u) | facebook/bart-base | User Data | | |
|
| Mistral LORA / User data | [uc-ctds/gdc-cohort-llm-mistral-lora-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-mistral-lora-u) | mistralai/Mistral-7B-Instruct-v0.3 | User Data | | |
|
|
|
|
|
## Getting Started with GDC Cohort LLM |
|
|
|
While GDC Cohort LLM is trained over structured JSON outputs, generation is greatly improved by using a structured generation framework with a JSON schema defined by a [`pydantic`](https://github.com/pydantic/pydantic/) model. We provide a lightweight [pydantic model for GDC cohort filter JSONs](https://github.com/uc-cdis/gdc-cohort-copilot/blob/master/utils/schema.py) on our github repo. Using this schema and [`vLLM`](https://github.com/vllm-project/vllm) for structured generation, this model can be used as follows: |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from vllm.sampling_params import GuidedDecodingParams |
|
|
|
from schema import GDCCohortSchema |
|
|
|
JSON_SCHEMA = GDCCohortSchema.model_json_schema() |
|
|
|
MODEL_NAME = "uc-ctds/gdc-cohort-llm-gpt2-s1M" |
|
QUERY = "bam files from TCGA" |
|
|
|
decoding_params = GuidedDecodingParams(json=JSON_SCHEMA) |
|
sampling_params = SamplingParams( |
|
n=1, |
|
temperature=0, |
|
seed=42, |
|
max_tokens=1024, |
|
guided_decoding=decoding_params, |
|
) |
|
|
|
llm = LLM(model=MODEL_NAME) |
|
|
|
outputs = llm.generate( |
|
prompts=[QUERY], |
|
sampling_params=sampling_params, |
|
) |
|
cohort_filter = outputs[0].outputs[0].text |
|
print(cohort_filter) |
|
``` |
|
|
|
## Performance |
|
|
|
We demonstrate that our trained models can drastically outperform GPT-4o prompting, even when providing a full data dictionary to 4o. A detailed explanation of our evaluation metrics is provided in [our paper](https://arxiv.org/abs/2507.02221). |
|
|
|
| GDC Cohort LLM version | TPR | IoU | Exact | BERT | |
|
|-----------------------------------|-------|-------|-------|-------| |
|
| BART / User data | 0.117 | 0.078 | 0.028 | 0.735 | |
|
| Mistral LORA / User data | 0.124 | 0.117 | 0.092 | 0.835 | |
|
| GPT2 / User data | 0.365 | 0.331 | 0.221 | 0.819 | |
|
| GPT2 / User + 100K Synthetic data | 0.783 | 0.748 | 0.607 | 0.902 | |
|
| GPT2 / User + 1M Synthetic data | **0.855** | **0.832** | **0.702** | **0.919** | |
|
| GPT-4o (prompting w/ data dict) | 0.720 | 0.698 | 0.558 | 0.894 | |
|
|
|
# Citation |
|
```bibtex |
|
@article{song2025gdc, |
|
title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons}, |
|
author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L}, |
|
journal={arXiv preprint arXiv:2507.02221}, |
|
year={2025} |
|
} |
|
``` |
|
|