vrashad commited on
Commit
841be26
·
verified ·
1 Parent(s): d68ede2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -8
README.md CHANGED
@@ -1,8 +1,216 @@
1
- # Azerbaijani NER Model
2
-
3
- This is a Named Entity Recognition model for Azerbaijani language based on XLM-RoBERTa.
4
-
5
- ## Model details
6
- - Base model: xlm-roberta-base
7
- - Trained for Named Entity Recognition
8
- - Language: Azerbaijani
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - az
5
+ base_model:
6
+ - FacebookAI/xlm-roberta-base
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - personally identifiable information
10
+ - pii
11
+ - ner
12
+ - azerbaijan
13
+ ---
14
+
15
+
16
+ # PII NER Azerbaijani v2
17
+
18
+ **PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa.
19
+ It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
20
+
21
+ ## Model Details
22
+
23
+ - **Base Model:** XLM-RoBERTa
24
+ - **Training Metrics:**
25
+ -
26
+ | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
27
+ |-------|----------------|------------------|-----------|---------|----------|
28
+ | 1 | 0.029100 | 0.025319 | 0.963367 | 0.962449| 0.962907 |
29
+ | 2 | 0.019900 | 0.023291 | 0.964567 | 0.968474| 0.966517 |
30
+ | 3 | 0.015400 | 0.018993 | 0.969536 | 0.967555| 0.968544 |
31
+ | 4 | 0.012700 | 0.017730 | 0.971919 | 0.969768| 0.970842 |
32
+ | 5 | 0.011100 | 0.018095 | 0.973056 | 0.970075| 0.971563 |
33
+
34
+
35
+
36
+ - **Test Metrics:**
37
+
38
+ - **Precision:** 0.9760
39
+ - **Recall:** 0.9732
40
+ - **F1 Score:** 0.9746
41
+
42
+
43
+ ## Detailed Test Classification Report
44
+
45
+ | Entity | Precision | Recall | F1-score | Support |
46
+ |---------------------|-----------|--------|----------|---------|
47
+ | AGE | 0.98 | 0.98 | 0.98 | 509 |
48
+ | BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 |
49
+ | CITY | 1.00 | 1.00 | 1.00 | 2100 |
50
+ | CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 |
51
+ | DATE | 0.85 | 0.92 | 0.88 | 1576 |
52
+ | DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 |
53
+ | EMAIL | 0.98 | 1.00 | 0.99 | 1485 |
54
+ | GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 |
55
+ | IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 |
56
+ | PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 |
57
+ | STREET | 0.94 | 0.98 | 0.96 | 1480 |
58
+ | SURNAME | 1.00 | 1.00 | 1.00 | 3357 |
59
+ | TAXNUM | 0.99 | 1.00 | 0.99 | 240 |
60
+ | TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 |
61
+ | TIME | 0.96 | 0.96 | 0.96 | 2216 |
62
+ | ZIPCODE | 0.97 | 0.97 | 0.97 | 520 |
63
+
64
+
65
+ ### Averages
66
+
67
+ | Metric | Precision | Recall | F1-score | Support |
68
+ |---------------|-----------|--------|----------|---------|
69
+ | **Micro avg** | 0.98 | 0.97 | 0.97 | 28976 |
70
+ | **Macro avg** | 0.97 | 0.96 | 0.97 | 28976 |
71
+ | **Weighted avg** | 0.98 | 0.97 | 0.97 | 28976 |
72
+
73
+
74
+ ## A list of entities that the model is able to recognize.
75
+
76
+ ```python
77
+ [
78
+ "AGE",
79
+ "BUILDINGNUM",
80
+ "CITY",
81
+ "CREDITCARDNUMBER",
82
+ "DATE",
83
+ "DRIVERLICENSENUM",
84
+ "EMAIL",
85
+ "GIVENNAME",
86
+ "IDCARDNUM",
87
+ "PASSPORTNUM",
88
+ "STREET",
89
+ "SURNAME",
90
+ "TAXNUM",
91
+ "TELEPHONENUM",
92
+ "TIME",
93
+ "ZIPCODE"
94
+ ]
95
+
96
+ ```
97
+
98
+ ## Usage
99
+
100
+ To use the model for spell correction:
101
+
102
+ ```python
103
+ import torch
104
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
105
+
106
+ model_id = "LocalDoc/private_ner_azerbaijani_v2"
107
+
108
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
109
+ model = AutoModelForTokenClassification.from_pretrained(model_id)
110
+
111
+ test_text = (
112
+ "Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir."
113
+ )
114
+
115
+ inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True)
116
+
117
+ offset_mapping = inputs.pop("offset_mapping")
118
+
119
+ with torch.no_grad():
120
+ outputs = model(**inputs)
121
+
122
+ predictions = torch.argmax(outputs.logits, dim=2)
123
+
124
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
125
+ offset_mapping = offset_mapping[0].tolist()
126
+ predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
127
+ word_ids = inputs.word_ids(batch_index=0)
128
+
129
+ aggregated = []
130
+ prev_word_id = None
131
+ for idx, word_id in enumerate(word_ids):
132
+ if word_id is None:
133
+ continue
134
+ if word_id != prev_word_id:
135
+ aggregated.append({
136
+ "word_id": word_id,
137
+ "tokens": [tokens[idx]],
138
+ "offsets": [offset_mapping[idx]],
139
+ "label": predicted_labels[idx]
140
+ })
141
+ else:
142
+ aggregated[-1]["tokens"].append(tokens[idx])
143
+ aggregated[-1]["offsets"].append(offset_mapping[idx])
144
+ prev_word_id = word_id
145
+
146
+ entities = []
147
+ current_entity = None
148
+ for word in aggregated:
149
+ if word["label"] == "O":
150
+ if current_entity is not None:
151
+ entities.append(current_entity)
152
+ current_entity = None
153
+ else:
154
+ if current_entity is None:
155
+ current_entity = {
156
+ "type": word["label"],
157
+ "start": word["offsets"][0][0],
158
+ "end": word["offsets"][-1][1]
159
+ }
160
+ else:
161
+ if word["label"] == current_entity["type"]:
162
+ current_entity["end"] = word["offsets"][-1][1]
163
+ else:
164
+ entities.append(current_entity)
165
+ current_entity = {
166
+ "type": word["label"],
167
+ "start": word["offsets"][0][0],
168
+ "end": word["offsets"][-1][1]
169
+ }
170
+ if current_entity is not None:
171
+ entities.append(current_entity)
172
+
173
+ for entity in entities:
174
+ entity["text"] = test_text[entity["start"]:entity["end"]]
175
+
176
+ for entity in entities:
177
+ print(entity)
178
+ ```
179
+
180
+ ```json
181
+ {'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'}
182
+ {'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'}
183
+ {'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'}
184
+ {'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'}
185
+ {'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'}
186
+ {'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'}
187
+ ```
188
+
189
+
190
+ ## CC BY 4.0 License — What It Allows
191
+
192
+ The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
193
+
194
+ ### ✅ You Can:
195
+ - **Use** the model for any purpose, including commercial use.
196
+ - **Share** it — copy and redistribute in any medium or format.
197
+ - **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.
198
+
199
+ ### 📝 You Must:
200
+ - **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
201
+ - **Not imply endorsement** — Do not suggest the original author endorses you or your use.
202
+
203
+ ### ❌ You Cannot:
204
+ - Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
205
+
206
+
207
+ ### Summary:
208
+ You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
209
+
210
+
211
+ For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY-NC-ND 4.0 license</a>.
212
+
213
+
214
+ ## Contact
215
+
216
+ For more information, questions, or issues, please contact LocalDoc at [[email protected]].