|
--- |
|
language: en |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- medical |
|
- healthcare |
|
- medical-feature-extraction |
|
- clinical-nlp |
|
- calibration |
|
- instruction-fine-tuned |
|
- nlp |
|
- mistral |
|
base_model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit |
|
datasets: |
|
- nbme-score-clinical-patient-notes |
|
--- |
|
# Mistral_calibrative_few |
|
|
|
## Model Description |
|
|
|
This model is the few-shot trained calibrative fine-tuned version of Multi-CONFE (Confidence-Aware Medical Feature Extraction), built on [unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit). It demonstrates exceptional data efficiency by achieving near state-of-the-art performance while training on only 12.5% of the available data, with particular emphasis on confidence calibration and hallucination reduction. |
|
|
|
## Intended Use |
|
|
|
This model is designed for extracting clinically relevant features from medical patient notes with high accuracy and well-calibrated confidence scores in low-resource settings. It's particularly useful for automated assessment of medical documentation, such as USMLE Step-2 Clinical Skills notes, when training data is limited. |
|
|
|
## Training Data |
|
|
|
The model was trained on just 100 annotated patient notes (12.5% of the full dataset) from the [NBME - Score Clinical Patient Notes](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes) Kaggle competition dataset. This represents approximately 10 examples per clinical case type. The dataset contains USMLE Step-2 Clinical Skills patient notes covering 10 different clinical cases, with each note containing expert annotations for multiple medical features that need to be extracted. |
|
|
|
## Training Procedure |
|
|
|
Training involved a two-phase approach: |
|
1. **Instructive Few-Shot Fine-Tuning**: Initial alignment of the model with the medical feature extraction task using Mistral Nemo Instruct as the base model. |
|
2. **Calibrative Fine-Tuning**: Integration of confidence calibration mechanisms, including bidirectional feature mapping, complexity-aware confidence adjustment, and dynamic thresholding. |
|
|
|
Training hyperparameters: |
|
- Base model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit |
|
- LoRA rank: 32 |
|
- Training epochs: 14 (instructive phase) + 5 (calibrative phase) |
|
- Learning rate: 2e-4 (instructive phase), 1e-4 (calibrative phase) |
|
- Optimizer: AdamW (8-bit) |
|
- Hallucination weight: 0.2 |
|
- Missing feature weight: 0.5 |
|
- Confidence threshold: 0.7 |
|
|
|
## Performance |
|
|
|
On the USMLE Step-2 Clinical Skills notes dataset: |
|
- Precision: 0.982 |
|
- Recall: 0.964 |
|
- F1 Score: 0.973 |
|
|
|
The model achieves this impressive performance with only 12.5% of the training data used for the full model, demonstrating exceptional data efficiency. It reduces hallucination by 84.9% and missing features by 85.0% compared to vanilla models. This makes it particularly valuable for domains where annotated data may be scarce or expensive to obtain. |
|
|
|
## Limitations |
|
|
|
- The model was evaluated on standardized USMLE Step-2 Clinical Skills notes and may require adaptation for other clinical domains. |
|
- Some errors stem from knowledge gaps in specific medical terminology or inconsistencies in annotation. |
|
- Performance on multilingual or non-standardized clinical notes remains untested. |
|
- While highly effective, it still performs slightly below the full-data model (F1 score 0.973 vs. 0.981). |
|
|
|
## Ethical Considerations |
|
|
|
Automated assessment systems must ensure fairness across different student populations. While the calibration mechanism enhances interpretability, systematic bias testing is recommended before deployment in high-stakes assessment scenarios. When using this model for educational assessment, we recommend: |
|
|
|
1. Implementing a human-in-the-loop validation process |
|
2. Regular auditing for demographic parity |
|
3. Clear communication to students about the use of AI in assessment |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "Manal0809/Mistral_calibrative_few" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) |
|
|
|
# Example input |
|
patient_note = """HPI: 35 yo F with heavy uterine bleeding. Last normal period was 6 month ago. |
|
LMP was 2 months ago. No clots. |
|
Changes tampon every few hours, previously 4/day. Menarche at 12. |
|
Attempted using OCPs for menstrual regulation previously but unsuccessful. |
|
Two adolescent children (ages unknown) at home. |
|
Last PAP 6 months ago was normal, never abnormal. |
|
Gained 10-15 lbs over the past few months, eating out more though. |
|
Hyperpigmented spots on hands and LT neck that she noticed 1-2 years ago. |
|
SH: state social worker; no smoking or drug use; beer or two on weekends; |
|
sexually active with boyfriend of 14 months, uses condoms at first but no longer uses them.""" |
|
|
|
features_to_extract = ["35-year", "Female", "heavy-periods", "symptoms-for-6-months", |
|
"Weight-Gain", "Last-menstrual-period-2-months-ago", |
|
"Fatigue", "Unprotected-Sex", "Infertility"] |
|
|
|
# Format input as shown in the paper |
|
input_text = f"""###instruction: Extract medical features from the patient note. |
|
###patient_history: {patient_note} |
|
###features: {features_to_extract} |
|
### Annotation:""" |
|
|
|
# Generate output |
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
outputs = model.generate( |
|
inputs["input_ids"], |
|
max_new_tokens=512, |
|
temperature=0.2, |
|
num_return_sequences=1 |
|
) |
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(result) |
|
``` |
|
|
|
## Model Card Author |
|
|
|
Manal Abumelha - [email protected] |
|
|
|
## Citation |
|
|
|
|