raihan-js commited on
Commit
9afef3b
·
verified ·
1 Parent(s): a575bb1

Upload MedLLM-10M medical language model

Browse files
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ tags:
5
+ - medical
6
+ - healthcare
7
+ - gpt
8
+ - text-generation
9
+ - clinical
10
+ - biology
11
+ - medicine
12
+ datasets:
13
+ - medical-literature
14
+ - pubmed
15
+ widget:
16
+ - text: "Symptoms of diabetes include"
17
+ example_title: "Medical Symptoms"
18
+ - text: "Treatment for hypertension involves"
19
+ example_title: "Medical Treatment"
20
+ - text: "The patient presents with chest pain and"
21
+ example_title: "Clinical Note"
22
+ - text: "Question: What is high blood pressure? Answer:"
23
+ example_title: "Medical Q&A"
24
+ pipeline_tag: text-generation
25
+ ---
26
+
27
+ # MedLLM-10M: Medical Language Model
28
+
29
+ ## Model Description
30
+
31
+ MedLLM-10M is a lightweight GPT-style language model specifically trained on medical literature and clinical text. This model is designed for educational and research purposes in the medical domain.
32
+
33
+ ⚠️ **Important Disclaimer**: This model is for research and educational purposes only. It should never be used for actual medical diagnosis, treatment recommendations, or clinical decision-making without proper medical supervision.
34
+
35
+ ## Model Details
36
+
37
+ - **Model Type**: Causal Language Model (GPT-style)
38
+ - **Parameters**: ~27.7M
39
+ - **Architecture**: Transformer decoder
40
+ - **Training Data**: Medical literature, PubMed abstracts, clinical guidelines
41
+ - **Vocabulary Size**: 5,000
42
+ - **Context Length**: 512 tokens
43
+ - **License**: Apache 2.0
44
+
45
+ ## Architecture
46
+
47
+ ```
48
+ Layers: 8
49
+ Hidden Size: 512
50
+ Attention Heads: 8
51
+ Feed Forward Size: 2048
52
+ Dropout: 0.1
53
+ Activation: gelu
54
+ ```
55
+
56
+ ## Training Details
57
+
58
+ The model was trained on a curated dataset of medical literature including:
59
+ - PubMed abstracts and research papers
60
+ - Medical journal articles
61
+ - Clinical practice guidelines
62
+ - Medical Q&A datasets
63
+ - Healthcare websites (Mayo Clinic, WebMD, etc.)
64
+
65
+ ### Training Hyperparameters
66
+
67
+ - **Epochs**: 10
68
+ - **Batch Size**: 4
69
+ - **Learning Rate**: 0.0003
70
+ - **Optimizer**: AdamW
71
+ - **Weight Decay**: 0.01
72
+ - **Mixed Precision**: FP16 (if available)
73
+
74
+ ### Hardware
75
+
76
+ - **Training Hardware**: NVIDIA RTX 3060 (12GB VRAM)
77
+ - **Framework**: PyTorch + Transformers
78
+
79
+ ## Usage
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForCausalLM
83
+
84
+ # Load model and tokenizer
85
+ tokenizer = AutoTokenizer.from_pretrained("raihan-js/medllm-10m")
86
+ model = AutoModelForCausalLM.from_pretrained("raihan-js/medllm-10m")
87
+
88
+ # Generate medical text
89
+ prompt = "Symptoms of diabetes include"
90
+ inputs = tokenizer(prompt, return_tensors="pt")
91
+ outputs = model.generate(
92
+ **inputs,
93
+ max_length=100,
94
+ do_sample=True,
95
+ temperature=0.7,
96
+ pad_token_id=tokenizer.eos_token_id
97
+ )
98
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
99
+ print(response)
100
+ ```
101
+
102
+ ## Model Performance
103
+
104
+ This is an early-stage model trained on limited data. Current capabilities include:
105
+ - Basic medical terminology understanding
106
+ - Simple text completion in medical contexts
107
+ - Educational content generation
108
+
109
+ **Known Limitations**:
110
+ - May generate incoherent or medically inaccurate text
111
+ - Requires significant additional training for production use
112
+ - Should not be used for medical advice or diagnosis
113
+
114
+ ## Intended Use Cases
115
+
116
+ ### ✅ Appropriate Uses
117
+ - Educational demonstrations of medical language models
118
+ - Research into medical NLP applications
119
+ - Text completion for medical writing assistance (with human review)
120
+ - Learning and experimentation with transformer models
121
+
122
+ ### ❌ Inappropriate Uses
123
+ - **Medical diagnosis or treatment recommendations**
124
+ - **Clinical decision-making**
125
+ - **Patient care without human oversight**
126
+ - **Emergency medical situations**
127
+ - **Replacement for professional medical advice**
128
+
129
+ ## Ethical Considerations
130
+
131
+ ### Medical Disclaimer
132
+ ⚠️ **CRITICAL WARNING**: This model is NOT intended for medical use. Always consult qualified healthcare professionals for medical advice, diagnosis, or treatment.
133
+
134
+ ### Limitations and Biases
135
+ - Training data may contain biases present in medical literature
136
+ - Model may reflect historical or cultural biases in healthcare
137
+ - Performance varies significantly across different medical specialties
138
+ - May generate plausible but medically incorrect information
139
+
140
+ ## Development Status
141
+
142
+ This is an **experimental model** in early development. Future improvements planned:
143
+ - Expanded training dataset
144
+ - Longer training duration
145
+ - Better medical accuracy evaluation
146
+ - Safety filtering and alignment
147
+ - Domain-specific fine-tuning
148
+
149
+ ## Citation
150
+
151
+ ```bibtex
152
+ @misc{medllm2024,
153
+ title={MedLLM: A Lightweight Medical Language Model},
154
+ author={Raihan},
155
+ year={2024},
156
+ publisher={HuggingFace},
157
+ url={https://huggingface.co/raihan-js/medllm-10m}
158
+ }
159
+ ```
160
+
161
+ ## Contact
162
+
163
+ For questions about this model, please open an issue in the model repository.
164
+
165
+ ---
166
+
167
+ **Last Updated**: December 2024
168
+ **Model Version**: 1.0-alpha
169
+ **Status**: Experimental - Not for production use
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 1,
8
+ "embd_pdrop": 0.1,
9
+ "eos_token_id": 2,
10
+ "initializer_range": 0.02,
11
+ "layer_norm_epsilon": 1e-05,
12
+ "model_type": "gpt2",
13
+ "n_embd": 512,
14
+ "n_head": 8,
15
+ "n_inner": 2048,
16
+ "n_layer": 8,
17
+ "n_positions": 512,
18
+ "pad_token_id": 0,
19
+ "reorder_and_upcast_attn": false,
20
+ "resid_pdrop": 0.1,
21
+ "scale_attn_by_inverse_layer_idx": false,
22
+ "scale_attn_weights": true,
23
+ "summary_activation": null,
24
+ "summary_first_dropout": 0.1,
25
+ "summary_proj_to_labels": true,
26
+ "summary_type": "cls_index",
27
+ "summary_use_proj": true,
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.55.0",
30
+ "unk_token_id": 3,
31
+ "use_cache": true,
32
+ "vocab_size": 5000
33
+ }
demo.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # demo.py - Quick demo of the model
2
+ from transformers import AutoTokenizer, AutoModelForCausalLM
3
+
4
+ model_name = "raihan-js/medllm-10m"
5
+
6
+ print("Loading MedLLM...")
7
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
8
+ model = AutoModelForCausalLM.from_pretrained(model_name)
9
+
10
+ prompts = [
11
+ "Symptoms of diabetes include",
12
+ "Treatment for high blood pressure",
13
+ "The patient presents with"
14
+ ]
15
+
16
+ print("\nGenerating medical text:")
17
+ for prompt in prompts:
18
+ inputs = tokenizer(prompt, return_tensors="pt")
19
+ outputs = model.generate(
20
+ **inputs,
21
+ max_length=50,
22
+ do_sample=True,
23
+ temperature=0.7,
24
+ pad_token_id=tokenizer.eos_token_id
25
+ )
26
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
27
+ print(f"\nPrompt: {prompt}")
28
+ print(f"Response: {response}")
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.55.0"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c0921bd52a1e086e165e22e2abeea2b79804ea827700fd9003ee5f31168aba8
3
+ size 112178920
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<pad>",
3
+ "eos_token": "</s>",
4
+ "mask_token": "<mask>",
5
+ "pad_token": "<s>",
6
+ "unk_token": "<unk>"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<pad>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "eos_token": "</s>",
47
+ "extra_special_tokens": {},
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "pad_token": "<s>",
51
+ "tokenizer_class": "PreTrainedTokenizerFast",
52
+ "unk_token": "<unk>"
53
+ }
training_config.yaml ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ max_length: 512
3
+ min_doc_length: 100
4
+ stride: 256
5
+ huggingface:
6
+ license: apache-2.0
7
+ model_name: raihan-js/medllm-10m-v2
8
+ private: false
9
+ model:
10
+ activation: gelu
11
+ d_ff: 2048
12
+ d_model: 512
13
+ dropout: 0.1
14
+ max_seq_len: 512
15
+ n_heads: 8
16
+ n_layers: 8
17
+ name: MedLLM-10M-v2
18
+ vocab_size: 5000
19
+ paths:
20
+ data_dir: ./data
21
+ logs_dir: ./logs
22
+ model_dir: ./checkpoints/medllm-10m
23
+ tokenizer_dir: ./tokenizer/vocab
24
+ scraping:
25
+ delay_between_requests: 0.5
26
+ max_retries: 3
27
+ max_workers: 8
28
+ timeout: 30
29
+ training:
30
+ batch_size: 4
31
+ eval_steps: 50
32
+ fp16: true
33
+ grad_clip: 1.0
34
+ gradient_accumulation_steps: 8
35
+ learning_rate: 0.0003
36
+ num_epochs: 10
37
+ save_steps: 100
38
+ warmup_steps: 200
39
+ weight_decay: 0.01