---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- medical
- healthcare
- medical-feature-extraction
- clinical-nlp
- calibration
- instruction-fine-tuned
- nlp
- mistral
base_model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
datasets:
- nbme-score-clinical-patient-notes
---
# Mistral_calibrative_few

## Model Description

This model is the few-shot trained calibrative fine-tuned version of Multi-CONFE (Confidence-Aware Medical Feature Extraction), built on [unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit). It demonstrates exceptional data efficiency by achieving near state-of-the-art performance while training on only 12.5% of the available data, with particular emphasis on confidence calibration and hallucination reduction.

## Intended Use

This model is designed for extracting clinically relevant features from medical patient notes with high accuracy and well-calibrated confidence scores in low-resource settings. It's particularly useful for automated assessment of medical documentation, such as USMLE Step-2 Clinical Skills notes, when training data is limited.

## Training Data

The model was trained on just 100 annotated patient notes (12.5% of the full dataset) from the [NBME - Score Clinical Patient Notes](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes) Kaggle competition dataset. This represents approximately 10 examples per clinical case type. The dataset contains USMLE Step-2 Clinical Skills patient notes covering 10 different clinical cases, with each note containing expert annotations for multiple medical features that need to be extracted.

## Training Procedure

Training involved a two-phase approach:
1. **Instructive Few-Shot Fine-Tuning**: Initial alignment of the model with the medical feature extraction task using Mistral Nemo Instruct as the base model.
2. **Calibrative Fine-Tuning**: Integration of confidence calibration mechanisms, including bidirectional feature mapping, complexity-aware confidence adjustment, and dynamic thresholding.

Training hyperparameters:
- Base model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
- LoRA rank: 32
- Training epochs: 14 (instructive phase) + 5 (calibrative phase)
- Learning rate: 2e-4 (instructive phase), 1e-4 (calibrative phase)
- Optimizer: AdamW (8-bit)
- Hallucination weight: 0.2
- Missing feature weight: 0.5
- Confidence threshold: 0.7

## Performance

On the USMLE Step-2 Clinical Skills notes dataset:
- Precision: 0.982
- Recall: 0.964
- F1 Score: 0.973

The model achieves this impressive performance with only 12.5% of the training data used for the full model, demonstrating exceptional data efficiency. It reduces hallucination by 84.9% and missing features by 85.0% compared to vanilla models. This makes it particularly valuable for domains where annotated data may be scarce or expensive to obtain.

## Limitations

- The model was evaluated on standardized USMLE Step-2 Clinical Skills notes and may require adaptation for other clinical domains.
- Some errors stem from knowledge gaps in specific medical terminology or inconsistencies in annotation.
- Performance on multilingual or non-standardized clinical notes remains untested.
- While highly effective, it still performs slightly below the full-data model (F1 score 0.973 vs. 0.981).

## Ethical Considerations

Automated assessment systems must ensure fairness across different student populations. While the calibration mechanism enhances interpretability, systematic bias testing is recommended before deployment in high-stakes assessment scenarios. When using this model for educational assessment, we recommend:

1. Implementing a human-in-the-loop validation process
2. Regular auditing for demographic parity
3. Clear communication to students about the use of AI in assessment

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Manal0809/Mistral_calibrative_few"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Example input
patient_note = """HPI: 35 yo F with heavy uterine bleeding. Last normal period was 6 month ago. 
LMP was 2 months ago. No clots.
Changes tampon every few hours, previously 4/day. Menarche at 12.
Attempted using OCPs for menstrual regulation previously but unsuccessful.
Two adolescent children (ages unknown) at home.
Last PAP 6 months ago was normal, never abnormal.
Gained 10-15 lbs over the past few months, eating out more though.
Hyperpigmented spots on hands and LT neck that she noticed 1-2 years ago.
SH: state social worker; no smoking or drug use; beer or two on weekends;
sexually active with boyfriend of 14 months, uses condoms at first but no longer uses them."""

features_to_extract = ["35-year", "Female", "heavy-periods", "symptoms-for-6-months", 
                       "Weight-Gain", "Last-menstrual-period-2-months-ago", 
                       "Fatigue", "Unprotected-Sex", "Infertility"]

# Format input as shown in the paper
input_text = f"""###instruction: Extract medical features from the patient note.
###patient_history: {patient_note}
###features: {features_to_extract}
### Annotation:"""

# Generate output
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=512,
    temperature=0.2,
    num_return_sequences=1
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## Model Card Author

Manal Abumelha - mabumelha@kku.edu.sa

## Citation