--- language: en license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - medical - healthcare - medical-feature-extraction - clinical-nlp - calibration - instruction-fine-tuned - nlp - mistral base_model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit datasets: - nbme-score-clinical-patient-notes --- # Mistral_calibrative_few ## Model Description This model is the few-shot trained calibrative fine-tuned version of Multi-CONFE (Confidence-Aware Medical Feature Extraction), built on [unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit). It demonstrates exceptional data efficiency by achieving near state-of-the-art performance while training on only 12.5% of the available data, with particular emphasis on confidence calibration and hallucination reduction. ## Intended Use This model is designed for extracting clinically relevant features from medical patient notes with high accuracy and well-calibrated confidence scores in low-resource settings. It's particularly useful for automated assessment of medical documentation, such as USMLE Step-2 Clinical Skills notes, when training data is limited. ## Training Data The model was trained on just 100 annotated patient notes (12.5% of the full dataset) from the [NBME - Score Clinical Patient Notes](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes) Kaggle competition dataset. This represents approximately 10 examples per clinical case type. The dataset contains USMLE Step-2 Clinical Skills patient notes covering 10 different clinical cases, with each note containing expert annotations for multiple medical features that need to be extracted. ## Training Procedure Training involved a two-phase approach: 1. **Instructive Few-Shot Fine-Tuning**: Initial alignment of the model with the medical feature extraction task using Mistral Nemo Instruct as the base model. 2. **Calibrative Fine-Tuning**: Integration of confidence calibration mechanisms, including bidirectional feature mapping, complexity-aware confidence adjustment, and dynamic thresholding. Training hyperparameters: - Base model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit - LoRA rank: 32 - Training epochs: 14 (instructive phase) + 5 (calibrative phase) - Learning rate: 2e-4 (instructive phase), 1e-4 (calibrative phase) - Optimizer: AdamW (8-bit) - Hallucination weight: 0.2 - Missing feature weight: 0.5 - Confidence threshold: 0.7 ## Performance On the USMLE Step-2 Clinical Skills notes dataset: - Precision: 0.982 - Recall: 0.964 - F1 Score: 0.973 The model achieves this impressive performance with only 12.5% of the training data used for the full model, demonstrating exceptional data efficiency. It reduces hallucination by 84.9% and missing features by 85.0% compared to vanilla models. This makes it particularly valuable for domains where annotated data may be scarce or expensive to obtain. ## Limitations - The model was evaluated on standardized USMLE Step-2 Clinical Skills notes and may require adaptation for other clinical domains. - Some errors stem from knowledge gaps in specific medical terminology or inconsistencies in annotation. - Performance on multilingual or non-standardized clinical notes remains untested. - While highly effective, it still performs slightly below the full-data model (F1 score 0.973 vs. 0.981). ## Ethical Considerations Automated assessment systems must ensure fairness across different student populations. While the calibration mechanism enhances interpretability, systematic bias testing is recommended before deployment in high-stakes assessment scenarios. When using this model for educational assessment, we recommend: 1. Implementing a human-in-the-loop validation process 2. Regular auditing for demographic parity 3. Clear communication to students about the use of AI in assessment ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model_name = "Manal0809/Mistral_calibrative_few" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) # Example input patient_note = """HPI: 35 yo F with heavy uterine bleeding. Last normal period was 6 month ago. LMP was 2 months ago. No clots. Changes tampon every few hours, previously 4/day. Menarche at 12. Attempted using OCPs for menstrual regulation previously but unsuccessful. Two adolescent children (ages unknown) at home. Last PAP 6 months ago was normal, never abnormal. Gained 10-15 lbs over the past few months, eating out more though. Hyperpigmented spots on hands and LT neck that she noticed 1-2 years ago. SH: state social worker; no smoking or drug use; beer or two on weekends; sexually active with boyfriend of 14 months, uses condoms at first but no longer uses them.""" features_to_extract = ["35-year", "Female", "heavy-periods", "symptoms-for-6-months", "Weight-Gain", "Last-menstrual-period-2-months-ago", "Fatigue", "Unprotected-Sex", "Infertility"] # Format input as shown in the paper input_text = f"""###instruction: Extract medical features from the patient note. ###patient_history: {patient_note} ###features: {features_to_extract} ### Annotation:""" # Generate output inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate( inputs["input_ids"], max_new_tokens=512, temperature=0.2, num_return_sequences=1 ) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ## Model Card Author Manal Abumelha - mabumelha@kku.edu.sa ## Citation