Amirhossein75 commited on
Commit
29641e2
·
verified ·
1 Parent(s): ee0cde6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -0
README.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {
5
+ "language": ["en"],
6
+ "library_name": "transformers",
7
+ "pipeline_tag": "text-classification",
8
+ "task_categories": ["text-classification"],
9
+ "task_ids": ["multi-label-classification"],
10
+ "tags": ["multi-label", "emotion-detection", "reddit", "go_emotions", "pytorch", "huggingface", "peft", "accelerate"],
11
+ "datasets": ["go_emotions"],
12
+ "license": "other",
13
+ "model-index": [
14
+ {
15
+ "name": "multi-label-emotion-classification-reddit-comments (RoBERTa-base on GoEmotions)",
16
+ "results": [
17
+ {
18
+ "task": {"name": "Text Classification (multi-label emotions)", "type": "text-classification"},
19
+ "dataset": {"name": "GoEmotions", "type": "go_emotions", "config": "simplified", "split": "test"},
20
+ "metrics": [
21
+ {"name": "F1 (micro)", "type": "f1", "value": 0.5284209017274747, "args": {"average": "micro", "threshold": 0.84}},
22
+ {"name": "F1 (macro)", "type": "f1", "value": 0.49954895970228047, "args": {"average": "macro", "threshold": 0.84}},
23
+ {"name": "F1 (samples)", "type": "f1", "value": 0.5301482007949669, "args": {"average": "samples", "threshold": 0.84}},
24
+ {"name": "Average Precision (micro)", "type": "average_precision", "value": 0.5351637127240974, "args": {"average": "micro"}},
25
+ {"name": "Average Precision (macro)", "type": "average_precision", "value": 0.5087333698463412, "args": {"average": "macro"}},
26
+ {"name": "ROC AUC (micro)", "type": "auc", "value": 0.9517119218698238, "args": {"average": "micro"}},
27
+ {"name": "ROC AUC (macro)", "type": "auc", "value": 0.9310155721031019, "args": {"average": "macro"}}
28
+ ]
29
+ }
30
+ ]
31
+ }
32
+ ]
33
+ }
34
+ ---
35
+
36
+ # Model Card for Multi‑Label Emotion Classification on Reddit Comments
37
+
38
+ This repository contains training and inference code for **multi‑label emotion classification** of Reddit comments using the **GoEmotions** dataset (27 emotions + neutral) with a **RoBERTa‑base** encoder. It includes a configuration‑driven training script, evaluation, decision‑threshold tuning, and a lightweight inference entrypoint.
39
+
40
+ > **Repository:** https://github.com/amirhossein-yousefi/multi-label-emotion-classification-reddit-comments
41
+
42
+ ## Model Details
43
+
44
+ ### Model Description
45
+
46
+ This project fine‑tunes a Transformer encoder for multi‑label emotion detection on Reddit comments. The default configuration uses **`roberta-base`**, binary cross‑entropy loss (optionally focal loss), and grid‑search threshold tuning on the validation set.
47
+
48
+ - **Developed by:** GitHub **@amirhossein-yousefi**
49
+ - **Model type:** Multi‑label text classification (Transformer encoder)
50
+ - **Language(s) (NLP):** English
51
+ - **License:** No explicit license file was found in the repository; treat as “all rights reserved” unless the author adds a license.
52
+ - **Finetuned from model :** `roberta-base`
53
+
54
+ ### Model Sources
55
+
56
+ - **Repository:** https://github.com/amirhossein-yousefi/multi-label-emotion-classification-reddit-comments
57
+ - **Paper [dataset]:** GoEmotions: A Dataset of Fine‑Grained Emotions (Demszky et al., 2020)
58
+
59
+ ## Uses
60
+
61
+ ### Direct Use
62
+
63
+ - Tagging short English texts (e.g., social posts, comments) with multiple emotions from the GoEmotions taxonomy (e.g., *joy, sadness, anger, admiration, gratitude,* etc.).
64
+ - Exploratory analytics and visualization of emotion distributions in corpora similar to Reddit.
65
+
66
+ ### Downstream Use
67
+
68
+ - Fine‑tuning or domain adaptation to platforms beyond Reddit (forums, support tickets, app reviews).
69
+ - Serving as a baseline component in moderation pipelines or empathetic response systems (with careful human oversight).
70
+
71
+ ### Out‑of‑Scope Use
72
+
73
+ - Medical, psychological, or diagnostic use; mental‑health inference.
74
+ - High‑stakes decisions (employment, lending, safety) without rigorous, domain‑specific validation.
75
+ - Non‑English or heavily code‑switched text without additional training/testing.
76
+
77
+ ## Bias, Risks, and Limitations
78
+
79
+ - **Dataset origin:** GoEmotions is built from Reddit comments; models may inherit Reddit‑specific discourse, slang, and toxicity patterns and may underperform on other domains.
80
+ - **Annotation noise:** Third‑party analyses have raised concerns about mislabels in GoEmotions; treat labels as imperfect and consider human review for critical use cases.
81
+ - **Multi‑label uncertainty:** Threshold choice materially affects precision/recall trade‑offs. The repo tunes the threshold on validation data; you should recalibrate for your domain.
82
+
83
+ ### Recommendations
84
+
85
+ - Calibrate thresholds on in‑domain validation data (the repo grid‑searches 0.05–0.95).
86
+ - Report per‑label metrics, especially for minority emotions.
87
+ - Consider bias audits and human‑in‑the‑loop review before deployment.
88
+
89
+ ## How to Get Started with the Model
90
+
91
+ ### Environment
92
+
93
+ - Python ≥ **3.13**
94
+ - Install dependencies:
95
+ ```bash
96
+ pip install -r requirements.txt
97
+ ```
98
+
99
+ ### Train
100
+
101
+ The Makefile provides a default **train** target:
102
+
103
+ ```bash
104
+ python -m emoclass.train --config configs/base.yaml
105
+ ```
106
+
107
+ ### Inference
108
+
109
+ After training (or pointing to a trained directory), run:
110
+
111
+ ```bash
112
+ python -m emoclass.inference --model_dir outputs/goemotions_roberta --text "I love this!" "This is awful."
113
+ ```
114
+
115
+ ## Training Details
116
+
117
+ ### Training Data
118
+
119
+ - **Dataset:** GoEmotions (27 emotions + neutral). The default config uses the **`simplified`** variant.
120
+ - **Text column:** `text`
121
+ - **Labels column:** `labels`
122
+ - **Max sequence length:** 192
123
+
124
+ ### Training Procedure
125
+
126
+ #### Preprocessing
127
+
128
+ - Standard Transformer tokenization for `roberta-base`.
129
+ - Multi‑hot label encoding for emotions.
130
+
131
+ #### Training Hyperparameters
132
+
133
+ - **Base model:** `roberta-base`
134
+ - **Batch size:** 16 (train), 32 (eval)
135
+ - **Learning rate:** 2e‑5
136
+ - **Epochs:** 5
137
+ - **Weight decay:** 0.01
138
+ - **Warmup ratio:** 0.06
139
+ - **Gradient accumulation:** 1
140
+ - **Precision:** bf16/fp16 if available
141
+ - **Loss:** Binary Cross‑Entropy (optionally focal loss with γ=2.0, α=0.25)
142
+ - **Threshold tuning:** grid 0.05 → 0.95 (step 0.01); best val micro‑F1 ≈ 0.84
143
+ - **LoRA/PEFT:** available in config (default off)
144
+
145
+ #### Speeds, Sizes, Times
146
+
147
+ - See `results.txt` for an example run’s timing & throughput logs.
148
+
149
+ ## Evaluation
150
+
151
+ ### Testing Data, Factors & Metrics
152
+
153
+ - **Test split:** GoEmotions `simplified` test.
154
+ - **Metrics:** micro/macro/sample **F1**, micro/macro **Average Precision (AP)**, micro/macro **ROC‑AUC**.
155
+
156
+ ### Results (example run)
157
+
158
+ - **Threshold (val‑tuned):** 0.84
159
+ - **F1 (micro):** 0.5284
160
+ - **F1 (macro):** 0.4995
161
+ - **F1 (samples):** 0.5301
162
+ - **AP (micro):** 0.5352
163
+ - **AP (macro):** 0.5087
164
+ - **ROC‑AUC (micro):** 0.9517
165
+ - **ROC‑AUC (macro):** 0.9310
166
+
167
+ *(See `results.txt` for the full log and any updates.)*
168
+
169
+ ## Model Examination
170
+
171
+ - Inspect per‑label thresholds and confusion patterns; minority emotions (e.g., *grief, pride, nervousness*) often suffer lower F1 and need more tuning or class‑balancing strategies.
172
+
173
+ ## Environmental Impact
174
+
175
+ - Not measured. If desired, log GPU type, hours, region, and estimate emissions using the ML CO2 calculator.
176
+
177
+ ## Technical Specifications
178
+
179
+ ### Model Architecture and Objective
180
+
181
+ - Transformer encoder (`roberta-base`) fine‑tuned with a sigmoid multi‑label head and BCE (or focal) loss.
182
+
183
+ ### Compute Infrastructure
184
+
185
+ - Frameworks: `transformers`, `datasets`, `accelerate`, `evaluate`, `scikit-learn`, optional `peft`.
186
+ - Hardware/software specifics are user‑dependent.
187
+
188
+ ## Citation
189
+
190
+ **GoEmotions (dataset/paper):**
191
+ Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). *GoEmotions: A Dataset of Fine‑Grained Emotions.* ACL 2020. https://arxiv.org/abs/2005.00547
192
+
193
+ **BibTeX:**
194
+ ```bibtex
195
+ @inproceedings{demszky2020goemotions,
196
+ title={GoEmotions: A Dataset of Fine-Grained Emotions},
197
+ author={Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
198
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
199
+ year={2020}
200
+ }
201
+ ```
202
+
203
+ ## Glossary
204
+
205
+ - **AP:** Average Precision (area under precision–recall curve).
206
+ - **AUC:** Area under ROC curve.
207
+ - **Micro/Macro F1:** Micro aggregates over all labels; macro averages per‑label F1.
208
+
209
+ ## More Information
210
+
211
+ - The configuration file at `configs/base.yaml` documents tweakable knobs (loss type, LoRA, precision, etc.).
212
+ - Artifacts are saved under `outputs/` by default.
213
+
214
+ ## Model Card Authors
215
+
216
+ - Original code: @amirhossein-yousefi
217
+ - Model card: generated programmatically for documentation purposes.
218
+
219
+ ## Model Card Contact
220
+
221
+ - Open an issue in the GitHub repository.