File size: 6,685 Bytes
6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 cb7eae8 367f5f8 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 f52930c 6d81b94 367f5f8 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 f52930c 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 6d81b94 367f5f8 f52930c 6d81b94 367f5f8 6d81b94 367f5f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
---
library_name: transformers
tags:
- nucleotide-transformer
- AMR-prediction
- bioinformatics
- sequence-classification
- LoRA
---
# Model Card for DraGNOME-2.5b-v1
This model is a fine-tuned version of the Nucleotide Transformer (2.5B parameters, multi-species) for Antimicrobial Resistance (AMR) prediction, optimized for handling class imbalance and training efficiency.
## Model Details
### Model Description
This model is a fine-tuned version of InstaDeepAI's Nucleotide Transformer (2.5B parameters, multi-species) designed for binary classification of nucleotide sequences to predict Antimicrobial Resistance (AMR). It leverages LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning and includes optimizations for class imbalance and training efficiency, with checkpointing to handle Google Colab's 24-hour runtime limit. The model was trained on a dataset of positive (AMR) and negative (non-AMR) sequences.
- **Developed by:** Blaise Alako
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** alakob
- **Model type:** Sequence Classification
- **Language(s) (NLP):** Nucleotide sequences
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** InstaDeepAI/nucleotide-transformer-2.5b-multi-species
### Model Sources [optional]
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
### Direct Use
This model can be used directly for predicting whether a given nucleotide sequence is associated with Antimicrobial Resistance (AMR) without additional fine-tuning.
### Downstream Use
The model can be further fine-tuned for specific AMR-related tasks or integrated into larger bioinformatics pipelines for genomic analysis.
### Out-of-Scope Use
The model is not intended for general-purpose sequence analysis beyond AMR prediction, nor for non-biological sequence data. Misuse could include applying it to unrelated classification tasks where its training data and architecture are not applicable.
## Bias, Risks, and Limitations
The model may exhibit bias due to imbalances in the training dataset or underrepresentation of certain AMR mechanisms. It is limited by the quality and diversity of the training sequences and may not generalize well to rare or novel AMR variants.
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Validation on diverse datasets and careful interpretation of predictions are recommended.
## How to Get Started with the Model
Use the code below to get started with the model:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import get_peft_model, LoraConfig
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-multi-species")
model = AutoModelForSequenceClassification.from_pretrained("alakob/DraGNOME-2.5b-v1")
# Example inference
sequence = "ATGC..." # Replace with your nucleotide sequence
inputs = tokenizer(sequence, truncation=True, max_length=1000, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item() # 0 = non-AMR, 1 = AMR
```
## Training Details
### Training Data
- **Negative sequences (non-AMR):**
`DSM_20231.fasta`, `ecoli-k12.fasta`, `FDA.fasta`
- **Positive sequences (AMR):**
`28227009.fasta`, `nucleotide_fasta_protein_homolog_model_variants.fasta`, `40859916.fasta`,
`nucleotide_fasta_protein_overexpression_model_variants.fasta`, `all_resfinder.fasta`,
`nucleotide_fasta_protein_variant_model_variants.fasta`, `efaecium.fasta`,
`nucleotide_fasta_rRNA_gene_variant_model_variants.fasta`
### Training Procedure
#### Preprocessing
Sequences were tokenized using the Nucleotide Transformer tokenizer with a maximum length of 1000 tokens and truncation applied where necessary.
#### Training Hyperparameters
- **Training regime:** fp16 mixed precision
- **Learning rate:** 5e-5
- **Batch size:** 8 (with gradient accumulation steps = 8)
- **Epochs:** 10
- **Optimizer:** AdamW (default in Hugging Face Trainer)
- **Scheduler:** Linear with 10% warmup
- **LoRA parameters:** `r=32`, `alpha=64`, `dropout=0.1`, `target_modules=["query", "value"]`
#### Speeds, Sizes, Times
Training was performed on Google Colab with checkpointing every 500 steps, retaining the last 3 checkpoints.
Exact throughput and times depend on Colab's hardware allocation NVIDIA A100 GPU.
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The test set was derived from a 10% split of the DraGNOME-2.5b-v1 dataset, stratified by AMR labels.
#### Factors
Evaluation was performed across AMR and non-AMR classes.
#### Metrics
- **Accuracy:** Proportion of correct predictions
- **F1 Score:** Harmonic mean of precision and recall (primary metric)
- **Precision:** Positive predictive value
- **Recall:** Sensitivity
- **ROC-AUC:** Area under the receiver operating characteristic curve
### Results
[More Information Needed]
#### Summary
[More Information Needed]
---
## Model Examination [optional]
[More Information Needed]
---
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** Google Colab NVIDIA A100 GPU
- **Hours used:** [More Information Needed]
- **Cloud Provider:** Google Colab
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]
---
## Technical Specifications [optional]
### Model Architecture and Objective
The model uses the Nucleotide Transformer architecture (2.5B parameters) with a sequence classification head, fine-tuned with LoRA for AMR prediction.
### Compute Infrastructure
Training was performed on Google Colab with persistent storage via Google Drive.
#### Hardware
- NVIDIA A100 GPU
#### Software
- Transformers (Hugging Face)
- PyTorch
- PEFT (Parameter-Efficient Fine-Tuning)
- Weights & Biases (wandb) for logging
---
## Citation [optional]
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
---
## Glossary
- **AMR:** Antimicrobial Resistance
- **LoRA:** Low-Rank Adaptation
- **Nucleotide Transformer:** A transformer-based model for nucleotide sequence analysis
---
## More Information [optional]
[More Information Needed]
---
## Model Card Authors
Blaise Alako
---
## Model Card Contact
[More Information Needed]
|