--- license: cc-by-4.0 language: - az base_model: - FacebookAI/xlm-roberta-base pipeline_tag: token-classification tags: - personally identifiable information - pii - ner - azerbaijan datasets: - LocalDoc/pii_ner_azerbaijani --- # PII NER Azerbaijani v2 **PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: PII NER Azerbaijani) based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text. ## Model Details - **Base Model:** XLM-RoBERTa - **Training Metrics:** - | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |-------|----------------|------------------|-----------|---------|----------| | 1 | 0.029100 | 0.025319 | 0.963367 | 0.962449| 0.962907 | | 2 | 0.019900 | 0.023291 | 0.964567 | 0.968474| 0.966517 | | 3 | 0.015400 | 0.018993 | 0.969536 | 0.967555| 0.968544 | | 4 | 0.012700 | 0.017730 | 0.971919 | 0.969768| 0.970842 | | 5 | 0.011100 | 0.018095 | 0.973056 | 0.970075| 0.971563 | - **Test Metrics:** - **Precision:** 0.9760 - **Recall:** 0.9732 - **F1 Score:** 0.9746 ## Detailed Test Classification Report | Entity | Precision | Recall | F1-score | Support | |---------------------|-----------|--------|----------|---------| | AGE | 0.98 | 0.98 | 0.98 | 509 | | BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 | | CITY | 1.00 | 1.00 | 1.00 | 2100 | | CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 | | DATE | 0.85 | 0.92 | 0.88 | 1576 | | DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 | | EMAIL | 0.98 | 1.00 | 0.99 | 1485 | | GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 | | IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 | | PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 | | STREET | 0.94 | 0.98 | 0.96 | 1480 | | SURNAME | 1.00 | 1.00 | 1.00 | 3357 | | TAXNUM | 0.99 | 1.00 | 0.99 | 240 | | TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 | | TIME | 0.96 | 0.96 | 0.96 | 2216 | | ZIPCODE | 0.97 | 0.97 | 0.97 | 520 | ### Averages | Metric | Precision | Recall | F1-score | Support | |---------------|-----------|--------|----------|---------| | **Micro avg** | 0.98 | 0.97 | 0.97 | 28976 | | **Macro avg** | 0.97 | 0.96 | 0.97 | 28976 | | **Weighted avg** | 0.98 | 0.97 | 0.97 | 28976 | ## A list of entities that the model is able to recognize. ```python [ "AGE", "BUILDINGNUM", "CITY", "CREDITCARDNUMBER", "DATE", "DRIVERLICENSENUM", "EMAIL", "GIVENNAME", "IDCARDNUM", "PASSPORTNUM", "STREET", "SURNAME", "TAXNUM", "TELEPHONENUM", "TIME", "ZIPCODE" ] ``` ## Usage To use the model for spell correction: The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind. ```python import torch from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast import numpy as np from typing import List, Dict, Tuple class AzerbaijaniNER: def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"): self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path) self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base") self.model.eval() self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model.to(self.device) self.id_to_label = { 0: "O", 1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER", 5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME", 9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME", 13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE", 17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER", 21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME", 25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME", 29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE" } self.entity_types = { "AGE": "Age", "BUILDINGNUM": "Building Number", "CITY": "City", "CREDITCARDNUMBER": "Credit Card Number", "DATE": "Date", "DRIVERLICENSENUM": "Driver License Number", "EMAIL": "Email", "GIVENNAME": "Given Name", "IDCARDNUM": "ID Card Number", "PASSPORTNUM": "Passport Number", "STREET": "Street", "SURNAME": "Surname", "TAXNUM": "Tax ID Number", "TELEPHONENUM": "Phone Number", "TIME": "Time", "ZIPCODE": "Zip Code" } def predict(self, text: str, max_length: int = 512) -> List[Dict]: text = text.lower() inputs = self.tokenizer( text, return_tensors="pt", max_length=max_length, padding="max_length", truncation=True, return_offsets_mapping=True ) offset_mapping = inputs.pop("offset_mapping").numpy()[0] inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): outputs = self.model(**inputs) predictions = outputs.logits.argmax(dim=2) predictions = predictions[0].cpu().numpy() entities = [] current_entity = None for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)): if offset[0] == 0 and offset[1] == 0: continue pred_label = self.id_to_label[pred_id] if pred_label.startswith("B-"): if current_entity: entities.append(current_entity) entity_type = pred_label[2:] current_entity = { "label": entity_type, "name": self.entity_types.get(entity_type, entity_type), "start": int(offset[0]), "end": int(offset[1]), "value": text[offset[0]:offset[1]] } elif pred_label.startswith("I-") and current_entity is not None: entity_type = pred_label[2:] if entity_type == current_entity["label"]: current_entity["end"] = int(offset[1]) current_entity["value"] = text[current_entity["start"]:current_entity["end"]] else: entities.append(current_entity) current_entity = None elif pred_label == "O" and current_entity is not None: entities.append(current_entity) current_entity = None if current_entity: entities.append(current_entity) return entities def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]: entities = self.predict(text) if not entities: return text, [] entities.sort(key=lambda x: x["start"], reverse=True) anonymized_text = text for entity in entities: start = entity["start"] end = entity["end"] length = end - start anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:] entities.sort(key=lambda x: x["start"]) return anonymized_text, entities def highlight_entities(self, text: str) -> str: entities = self.predict(text) if not entities: return text entities.sort(key=lambda x: x["start"], reverse=True) highlighted_text = text for entity in entities: start = entity["start"] end = entity["end"] entity_value = entity["value"] entity_type = entity["name"] highlighted_text = ( highlighted_text[:start] + f"[{entity_type}: {entity_value}]" + highlighted_text[end:] ) return highlighted_text if __name__ == "__main__": ner = AzerbaijaniNER() test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?""" print("=== Original Text ===") print(test_text) print("\n=== Found Entities ===") entities = ner.predict(test_text) for entity in entities: print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})") print("\n=== Text with Highlighted Entities ===") highlighted_text = ner.highlight_entities(test_text) print(highlighted_text) print("\n=== Anonymized Text ===") anonymized_text, _ = ner.anonymize_text(test_text) print(anonymized_text) ``` ``` === Original Text === Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? === Found Entities === Given Name: əli (positions 18-21) Surname: hüseynov (positions 22-30) Date: 15.05.1990 (positions 48-58) City: bakı (positions 64-68) Street: 28 may küçəsi (positions 80-93) Building Number: 4 (positions 94-95) Phone Number: +994552345678 (positions 132-145) Credit Card Number: 4169741358254152 (positions 155-171) === Text with Highlighted Entities === Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? === Anonymized Text === Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? ``` ## CC BY 4.0 License — What It Allows The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows: ### ✅ You Can: - **Use** the model for any purpose, including commercial use. - **Share** it — copy and redistribute in any medium or format. - **Adapt** it — remix, transform, and build upon it for any purpose, even commercially. ### 📝 You Must: - **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made). - **Not imply endorsement** — Do not suggest the original author endorses you or your use. ### ❌ You Cannot: - Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions). ### Summary: You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator. For more information, please refer to the CC BY 4.0 license. ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].