Update README.md

998ef3b verified 2 months ago

5.25 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- openai-community/gpt2
	pipeline_tag: text-generation
	---
	<!-- differ per variant: base_model metadata, title, intro par 2, and code example -->
	# GDC Cohort LLM - GPT2 / User + 1M Synthetic data

	GDC Cohort LLM is a language model which translates natural language descriptions of patient cohorts from the NCI Genomic Data Commons (GDC) into the structured JSON cohort filters used by GDC for search, retrieval, and analysis of cancer genomic data.

	`gdc-cohort-llm-gpt2-s1M` is a variant of the GDC Cohort LLM which trains a GPT2 model over user-derived and 1M synthetically sampled GDC cohort filters. This model is adapted from the pretrained weights of [`openai-community/gpt2`](https://huggingface.co/openai-community/gpt2).

	[GDC Cohort Copilot](https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot) is the corresponding web app to the model running on HuggingFace Spaces and specifically utilizes the `gdc-cohort-llm-gpt2-s1M` version of GDC Cohort LLM. Full details of our model development provided in [our paper](https://arxiv.org/abs/2507.02221) and [GitHub repo](https://github.com/uc-cdis/gdc-cohort-copilot).

	## Model Variations

	\| GDC Cohort LLM version \| HuggingFace Link \| Base Model \| Training Data \| Note \|
	\|-----------------------------------\|-------------------------------------------------------------------------------------------------------\|---------------------------\|----------------------------\|--------------------------------------\|
	\| GPT2 / User data \| [uc-ctds/gdc-cohort-llm-gpt2-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-u) \| openai-community/gpt2 \| User Data \| \|
	\| GPT2 / User + 100K Synthetic data \| [uc-ctds/gdc-cohort-llm-gpt2-s100K](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-s100K) \| openai-community/gpt2 \| User + 100K Synthetic Data \| \|
	\| GPT2 / User + 1M Synthetic data \| [uc-ctds/gdc-cohort-llm-gpt2-s1M](https://huggingface.co/uc-ctds/gdc-cohort-llm-gpt2-s1M) \| openai-community/gpt2 \| User + 1M Synthetic Data \| Deployed with GDC Cohort Copilot \|
	\| BART / User data \| [uc-ctds/gdc-cohort-llm-bart-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-bart-u) \| facebook/bart-base \| User Data \| \|
	\| Mistral LORA / User data \| [uc-ctds/gdc-cohort-llm-mistral-lora-u](https://huggingface.co/uc-ctds/gdc-cohort-llm-mistral-lora-u) \| mistralai/Mistral-7B-Instruct-v0.3 \| User Data \| \|


	## Getting Started with GDC Cohort LLM

	While GDC Cohort LLM is trained over structured JSON outputs, generation is greatly improved by using a structured generation framework with a JSON schema defined by a [`pydantic`](https://github.com/pydantic/pydantic/) model. We provide a lightweight [pydantic model for GDC cohort filter JSONs](https://github.com/uc-cdis/gdc-cohort-copilot/blob/master/utils/schema.py) on our github repo. Using this schema and [`vLLM`](https://github.com/vllm-project/vllm) for structured generation, this model can be used as follows:

	```python
	from vllm import LLM, SamplingParams
	from vllm.sampling_params import GuidedDecodingParams

	from schema import GDCCohortSchema

	JSON_SCHEMA = GDCCohortSchema.model_json_schema()

	MODEL_NAME = "uc-ctds/gdc-cohort-llm-gpt2-s1M"
	QUERY = "bam files from TCGA"

	decoding_params = GuidedDecodingParams(json=JSON_SCHEMA)
	sampling_params = SamplingParams(
	n=1,
	temperature=0,
	seed=42,
	max_tokens=1024,
	guided_decoding=decoding_params,
	)

	llm = LLM(model=MODEL_NAME)

	outputs = llm.generate(
	prompts=[QUERY],
	sampling_params=sampling_params,
	)
	cohort_filter = outputs[0].outputs[0].text
	print(cohort_filter)
	```

	## Performance

	We demonstrate that our trained models can drastically outperform GPT-4o prompting, even when providing a full data dictionary to 4o. A detailed explanation of our evaluation metrics is provided in [our paper](https://arxiv.org/abs/2507.02221).

	\| GDC Cohort LLM version \| TPR \| IoU \| Exact \| BERT \|
	\|-----------------------------------\|-------\|-------\|-------\|-------\|
	\| BART / User data \| 0.117 \| 0.078 \| 0.028 \| 0.735 \|
	\| Mistral LORA / User data \| 0.124 \| 0.117 \| 0.092 \| 0.835 \|
	\| GPT2 / User data \| 0.365 \| 0.331 \| 0.221 \| 0.819 \|
	\| GPT2 / User + 100K Synthetic data \| 0.783 \| 0.748 \| 0.607 \| 0.902 \|
	\| GPT2 / User + 1M Synthetic data \| 0.855 \| 0.832 \| 0.702 \| 0.919 \|
	\| GPT-4o (prompting w/ data dict) \| 0.720 \| 0.698 \| 0.558 \| 0.894 \|

	# Citation
	```bibtex
	@article{song2025gdc,
	title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
	author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
	journal={arXiv preprint arXiv:2507.02221},
	year={2025}
	}
	```