Upload MedLLM-10M medical language model

Browse files

Files changed (9) hide show

README.md +169 -3
config.json +33 -0
demo.py +28 -0
generation_config.json +7 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +53 -0
training_config.yaml +39 -0

README.md CHANGED Viewed

@@ -1,3 +1,169 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language: en
+tags:
+- medical
+- healthcare
+- gpt
+- text-generation
+- clinical
+- biology
+- medicine
+datasets:
+- medical-literature
+- pubmed
+widget:
+- text: "Symptoms of diabetes include"
+  example_title: "Medical Symptoms"
+- text: "Treatment for hypertension involves"
+  example_title: "Medical Treatment"
+- text: "The patient presents with chest pain and"
+  example_title: "Clinical Note"
+- text: "Question: What is high blood pressure? Answer:"
+  example_title: "Medical Q&A"
+pipeline_tag: text-generation
+---
+# MedLLM-10M: Medical Language Model
+## Model Description
+MedLLM-10M is a lightweight GPT-style language model specifically trained on medical literature and clinical text. This model is designed for educational and research purposes in the medical domain.
+⚠️ **Important Disclaimer**: This model is for research and educational purposes only. It should never be used for actual medical diagnosis, treatment recommendations, or clinical decision-making without proper medical supervision.
+## Model Details
+- **Model Type**: Causal Language Model (GPT-style)
+- **Parameters**: ~27.7M
+- **Architecture**: Transformer decoder
+- **Training Data**: Medical literature, PubMed abstracts, clinical guidelines
+- **Vocabulary Size**: 5,000
+- **Context Length**: 512 tokens
+- **License**: Apache 2.0
+## Architecture
+```
+Layers: 8
+Hidden Size: 512
+Attention Heads: 8
+Feed Forward Size: 2048
+Dropout: 0.1
+Activation: gelu
+```
+## Training Details
+The model was trained on a curated dataset of medical literature including:
+- PubMed abstracts and research papers
+- Medical journal articles
+- Clinical practice guidelines
+- Medical Q&A datasets
+- Healthcare websites (Mayo Clinic, WebMD, etc.)
+### Training Hyperparameters
+- **Epochs**: 10
+- **Batch Size**: 4
+- **Learning Rate**: 0.0003
+- **Optimizer**: AdamW
+- **Weight Decay**: 0.01
+- **Mixed Precision**: FP16 (if available)
+### Hardware
+- **Training Hardware**: NVIDIA RTX 3060 (12GB VRAM)
+- **Framework**: PyTorch + Transformers
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("raihan-js/medllm-10m")
+model = AutoModelForCausalLM.from_pretrained("raihan-js/medllm-10m")
+# Generate medical text
+prompt = "Symptoms of diabetes include"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    **inputs,
+    max_length=100,
+    do_sample=True,
+    temperature=0.7,
+    pad_token_id=tokenizer.eos_token_id
+)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## Model Performance
+This is an early-stage model trained on limited data. Current capabilities include:
+- Basic medical terminology understanding
+- Simple text completion in medical contexts
+- Educational content generation
+**Known Limitations**:
+- May generate incoherent or medically inaccurate text
+- Requires significant additional training for production use
+- Should not be used for medical advice or diagnosis
+## Intended Use Cases
+### ✅ Appropriate Uses
+- Educational demonstrations of medical language models
+- Research into medical NLP applications
+- Text completion for medical writing assistance (with human review)
+- Learning and experimentation with transformer models
+### ❌ Inappropriate Uses
+- **Medical diagnosis or treatment recommendations**
+- **Clinical decision-making**
+- **Patient care without human oversight**
+- **Emergency medical situations**
+- **Replacement for professional medical advice**
+## Ethical Considerations
+### Medical Disclaimer
+⚠️ **CRITICAL WARNING**: This model is NOT intended for medical use. Always consult qualified healthcare professionals for medical advice, diagnosis, or treatment.
+### Limitations and Biases
+- Training data may contain biases present in medical literature
+- Model may reflect historical or cultural biases in healthcare
+- Performance varies significantly across different medical specialties
+- May generate plausible but medically incorrect information
+## Development Status
+This is an **experimental model** in early development. Future improvements planned:
+- Expanded training dataset
+- Longer training duration
+- Better medical accuracy evaluation
+- Safety filtering and alignment
+- Domain-specific fine-tuning
+## Citation
+```bibtex
+@misc{medllm2024,
+  title={MedLLM: A Lightweight Medical Language Model},
+  author={Raihan},
+  year={2024},
+  publisher={HuggingFace},
+  url={https://huggingface.co/raihan-js/medllm-10m}
+}
+```
+## Contact
+For questions about this model, please open an issue in the model repository.
+---
+**Last Updated**: December 2024
+**Model Version**: 1.0-alpha
+**Status**: Experimental - Not for production use

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "activation_function": "gelu",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 1,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 2,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_embd": 512,
+  "n_head": 8,
+  "n_inner": 2048,
+  "n_layer": 8,
+  "n_positions": 512,
+  "pad_token_id": 0,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.1,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.55.0",
+  "unk_token_id": 3,
+  "use_cache": true,
+  "vocab_size": 5000
+}

demo.py ADDED Viewed

	@@ -0,0 +1,28 @@

+# demo.py - Quick demo of the model
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "raihan-js/medllm-10m"
+print("Loading MedLLM...")
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+prompts = [
+    "Symptoms of diabetes include",
+    "Treatment for high blood pressure",
+    "The patient presents with"
+]
+print("\nGenerating medical text:")
+for prompt in prompts:
+    inputs = tokenizer(prompt, return_tensors="pt")
+    outputs = model.generate(
+        **inputs,
+        max_length=50,
+        do_sample=True,
+        temperature=0.7,
+        pad_token_id=tokenizer.eos_token_id
+    )
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    print(f"\nPrompt: {prompt}")
+    print(f"Response: {response}")

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "transformers_version": "4.55.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c0921bd52a1e086e165e22e2abeea2b79804ea827700fd9003ee5f31168aba8
+size 112178920

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "bos_token": "<pad>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "pad_token": "<s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<pad>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<s>",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<unk>"
+}

training_config.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+data:
+  max_length: 512
+  min_doc_length: 100
+  stride: 256
+huggingface:
+  license: apache-2.0
+  model_name: raihan-js/medllm-10m-v2
+  private: false
+model:
+  activation: gelu
+  d_ff: 2048
+  d_model: 512
+  dropout: 0.1
+  max_seq_len: 512
+  n_heads: 8
+  n_layers: 8
+  name: MedLLM-10M-v2
+  vocab_size: 5000
+paths:
+  data_dir: ./data
+  logs_dir: ./logs
+  model_dir: ./checkpoints/medllm-10m
+  tokenizer_dir: ./tokenizer/vocab
+scraping:
+  delay_between_requests: 0.5
+  max_retries: 3
+  max_workers: 8
+  timeout: 30
+training:
+  batch_size: 4
+  eval_steps: 50
+  fp16: true
+  grad_clip: 1.0
+  gradient_accumulation_steps: 8
+  learning_rate: 0.0003
+  num_epochs: 10
+  save_steps: 100
+  warmup_steps: 200
+  weight_decay: 0.01