grammarly
/

pseudonymization-seq2seq

Text Generation

text2text-generation

Model card Files Files and versions

oleksandryermilov commited on Aug 9, 2023

Commit

fc81f44

·

1 Parent(s): 467887d

Update README.md

Files changed (1) hide show

README.md +97 -0

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+datasets:
+- grammarly/pseudonymization-data
+language:
+- en
+metrics:
+- f1
+- bleu
+pipeline_tag: text2text-generation
 ---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+This repository contains files for two Seq2Seq transformers-based models used in our paper: https://arxiv.org/abs/2306.05561.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** Oleksandr Yermilov, Vipul Raheja, Artem Chernodub
+- **Model type:** Seq2Seq
+- **Language (NLP):** English
+- **License:** Apache license 2.0
+- **Finetuned from model:** BART
+### Model Sources
+- **Paper:** https://arxiv.org/abs/2306.05561
+## Uses
+These models can be used for anonymizing datasets in English language.
+## Bias, Risks, and Limitations
+Please check the Limitations section in our paper.
+## Training Details
+### Training Data
+https://huggingface.co/datasets/grammarly/pseudonymization-data/tree/main/seq2seq
+### Training Procedure
+1. Gather text data from Wikipedia.
+2. Preprocess it using NER-based pseudonymization.
+3. Fine-tune BART model on translation task for translating text from "original" to "pseudonymized".
+#### Training Hyperparameters
+We train the models for 3 epochs using `AdamW` optimization with the learning rate α =2*10<sup>5</sup>, and the batch size is 8.
+## Evaluation
+### Factors & Metrics
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+There is no source truth of named entities for the data, on which this model was trained. We check whether the word is a named entity, using one of the NER systems (spaCy or FLAIR).
+#### Metrics
+We measure the amount of text, changed by our model. Specifically, we check for the following categories of translated text word by word:
+1. True positive (TP) - Named entity, which was changed to another named entity.
+2. True negative (TN) - Not a named entity, which was not changed.
+3. False positive (FP) - Not a named entity, which was changed to another word.
+4. False negative (FN) - Named entity, which was not changed to another named entity.
+We calculate F<sub>1</sub> score based on the abovementioned values.
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```
+@misc{yermilov2023privacy,
+      title={Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization},
+      author={Oleksandr Yermilov and Vipul Raheja and Artem Chernodub},
+      year={2023},
+      eprint={2306.05561},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## Model Card Contact
+Oleksandr Yermilov ([email protected]).