File size: 12,527 Bytes

---
license: cc-by-4.0
language:
- az
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- personally identifiable information
- pii
- ner
- azerbaijan
datasets:
- LocalDoc/pii_ner_azerbaijani
---


# PII NER Azerbaijani v2

**PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa. 
It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.

## Model Details

- **Base Model:** XLM-RoBERTa
- **Training Metrics:**
- 
| Epoch | Training Loss | Validation Loss | Precision | Recall  | F1       |
|-------|----------------|------------------|-----------|---------|----------|
| 1     | 0.029100       | 0.025319         | 0.963367  | 0.962449| 0.962907 |
| 2     | 0.019900       | 0.023291         | 0.964567  | 0.968474| 0.966517 |
| 3     | 0.015400       | 0.018993         | 0.969536  | 0.967555| 0.968544 |
| 4     | 0.012700       | 0.017730         | 0.971919  | 0.969768| 0.970842 |
| 5     | 0.011100       | 0.018095         | 0.973056  | 0.970075| 0.971563 |



- **Test Metrics:**  

- **Precision:** 0.9760  
- **Recall:** 0.9732  
- **F1 Score:** 0.9746  


## Detailed Test Classification Report

| Entity              | Precision | Recall | F1-score | Support |
|---------------------|-----------|--------|----------|---------|
| AGE                 | 0.98      | 0.98   | 0.98     | 509     |
| BUILDINGNUM         | 0.97      | 0.75   | 0.85     | 1285    |
| CITY                | 1.00      | 1.00   | 1.00     | 2100    |
| CREDITCARDNUMBER    | 0.99      | 0.98   | 0.99     | 249     |
| DATE                | 0.85      | 0.92   | 0.88     | 1576    |
| DRIVERLICENSENUM    | 0.98      | 0.98   | 0.98     | 258     |
| EMAIL               | 0.98      | 1.00   | 0.99     | 1485    |
| GIVENNAME           | 0.99      | 1.00   | 0.99     | 9926    |
| IDCARDNUM           | 0.99      | 0.99   | 0.99     | 1174    |
| PASSPORTNUM         | 0.99      | 0.99   | 0.99     | 426     |
| STREET              | 0.94      | 0.98   | 0.96     | 1480    |
| SURNAME             | 1.00      | 1.00   | 1.00     | 3357    |
| TAXNUM              | 0.99      | 1.00   | 0.99     | 240     |
| TELEPHONENUM        | 0.97      | 0.95   | 0.96     | 2175    |
| TIME                | 0.96      | 0.96   | 0.96     | 2216    |
| ZIPCODE             | 0.97      | 0.97   | 0.97     | 520     |


### Averages

| Metric        | Precision | Recall | F1-score | Support |
|---------------|-----------|--------|----------|---------|
| **Micro avg** | 0.98      | 0.97   | 0.97     | 28976   |
| **Macro avg** | 0.97      | 0.96   | 0.97     | 28976   |
| **Weighted avg** | 0.98   | 0.97   | 0.97     | 28976   |


## A list of entities that the model is able to recognize.

```python
[
    "AGE",
    "BUILDINGNUM",
    "CITY",
    "CREDITCARDNUMBER",
    "DATE",
    "DRIVERLICENSENUM",
    "EMAIL",
    "GIVENNAME",
    "IDCARDNUM",
    "PASSPORTNUM",
    "STREET",
    "SURNAME",
    "TAXNUM",
    "TELEPHONENUM",
    "TIME",
    "ZIPCODE"
]

```

## Usage

To use the model for spell correction:

The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind.

```python
import torch
from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast
import numpy as np
from typing import List, Dict, Tuple

class AzerbaijaniNER:
    def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"):
        self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)
        self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
        
        self.model.eval()
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        
        self.id_to_label = {
            0: "O",
            1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER",
            5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME",
            9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME",
            13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE",
            17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER",
            21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME", 
            25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME",
            29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE"
        }
        
        self.entity_types = {
            "AGE": "Age",
            "BUILDINGNUM": "Building Number",
            "CITY": "City",
            "CREDITCARDNUMBER": "Credit Card Number",
            "DATE": "Date",
            "DRIVERLICENSENUM": "Driver License Number",
            "EMAIL": "Email",
            "GIVENNAME": "Given Name",
            "IDCARDNUM": "ID Card Number",
            "PASSPORTNUM": "Passport Number",
            "STREET": "Street",
            "SURNAME": "Surname",
            "TAXNUM": "Tax ID Number",
            "TELEPHONENUM": "Phone Number",
            "TIME": "Time",
            "ZIPCODE": "Zip Code"
        }
    
    def predict(self, text: str, max_length: int = 512) -> List[Dict]:
        text = text.lower()
        
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=max_length,
            padding="max_length",
            truncation=True,
            return_offsets_mapping=True
        )
        
        offset_mapping = inputs.pop("offset_mapping").numpy()[0]
        
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = outputs.logits.argmax(dim=2)
        
        predictions = predictions[0].cpu().numpy()
        
        entities = []
        current_entity = None
        
        for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)):
            if offset[0] == 0 and offset[1] == 0:
                continue
                
            pred_label = self.id_to_label[pred_id]
            
            if pred_label.startswith("B-"):
                if current_entity:
                    entities.append(current_entity)
                
                entity_type = pred_label[2:]
                current_entity = {
                    "label": entity_type,
                    "name": self.entity_types.get(entity_type, entity_type),
                    "start": int(offset[0]),
                    "end": int(offset[1]),
                    "value": text[offset[0]:offset[1]]
                }
            
            elif pred_label.startswith("I-") and current_entity is not None:
                entity_type = pred_label[2:]
                
                if entity_type == current_entity["label"]:
                    current_entity["end"] = int(offset[1])
                    current_entity["value"] = text[current_entity["start"]:current_entity["end"]]
                else:
                    entities.append(current_entity)
                    current_entity = None
            
            elif pred_label == "O" and current_entity is not None:
                entities.append(current_entity)
                current_entity = None
        
        if current_entity:
            entities.append(current_entity)
        
        return entities
    
    def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]:
        entities = self.predict(text)
        
        if not entities:
            return text, []
        
        entities.sort(key=lambda x: x["start"], reverse=True)
        
        anonymized_text = text
        for entity in entities:
            start = entity["start"]
            end = entity["end"]
            length = end - start
            anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:]
        
        entities.sort(key=lambda x: x["start"])
        
        return anonymized_text, entities

    def highlight_entities(self, text: str) -> str:
        entities = self.predict(text)
        
        if not entities:
            return text
        
        entities.sort(key=lambda x: x["start"], reverse=True)
        
        highlighted_text = text
        for entity in entities:
            start = entity["start"]
            end = entity["end"]
            entity_value = entity["value"]
            entity_type = entity["name"]
            
            highlighted_text = (
                highlighted_text[:start] + 
                f"[{entity_type}: {entity_value}]" + 
                highlighted_text[end:]
            )
        
        return highlighted_text

if __name__ == "__main__":
    ner = AzerbaijaniNER()
    
    test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?"""
    
    print("=== Original Text ===")
    print(test_text)
    print("\n=== Found Entities ===")
    
    entities = ner.predict(test_text)
    for entity in entities:
        print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})")
    
    print("\n=== Text with Highlighted Entities ===")
    highlighted_text = ner.highlight_entities(test_text)
    print(highlighted_text)
    
    print("\n=== Anonymized Text ===")
    anonymized_text, _ = ner.anonymize_text(test_text)
    print(anonymized_text)
```

```
=== Original Text ===
Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

=== Found Entities ===
Given Name: əli (positions 18-21)
Surname: hüseynov (positions 22-30)
Date: 15.05.1990 (positions 48-58)
City: bakı (positions 64-68)
Street: 28 may küçəsi (positions 80-93)
Building Number: 4 (positions 94-95)
Phone Number: +994552345678 (positions 132-145)
Credit Card Number: 4169741358254152 (positions 155-171)

=== Text with Highlighted Entities ===
Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

=== Anonymized Text ===
Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
```


## CC BY 4.0 License — What It Allows

The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:

### ✅ You Can:
- **Use** the model for any purpose, including commercial use.
- **Share** it — copy and redistribute in any medium or format.
- **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.

### 📝 You Must:
- **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
- **Not imply endorsement** — Do not suggest the original author endorses you or your use.

### ❌ You Cannot:
- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).


### Summary:
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.


For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>.


## Contact

For more information, questions, or issues, please contact LocalDoc at [[email protected]].