--- license: apache-2.0 language: - en - fr - de - it - es - pt - pl - zh - nl base_model: - FacebookAI/xlm-roberta-base datasets: - PleIAs/ToxicCommons - yangezheng/SWSR-SexComment tags: - toxicity - data --- # Multilingual Toxicity Classifiers used in Apertus Pretraining #### Author: Olivia Simin Fan (@Olivia-umich) Language specific toxicity classifiers in English, French, German, Italian, Spanish, Portuguese, Polish, Chinese and Dutch, trained on [PleIAs/ToxicCommons](https://github.com/Pleias/toxic-commons) and [SWSR-SexComments](https://arxiv.org/pdf/2108.03070) datasets. ## Model Description Our toxicity classifier employs a two-stage approach: we first extract the multilingual document embeddings using [*XLM-RoBERTa*](https://huggingface.co/FacebookAI/xlm-roberta-base), then train a language-specific 2-layer MLP for binary toxicity classification on top of these embeddings for 6 epochs. The classifier checkpoints with the best accuracy on the held-out validation set are further employed to annotate the toxicity scores on FineWeb-2 and FineWeb. The validation accuracies on the held-out test set is provided as below: | Language | Accuracy | |----------|----------| | English (en) | 80.13% | | Chinese (zh) | 79.64% | | French (fr) | 82.34% | | German (de) | 82.61% | | Italian (it) | 82.16% | | Dutch (nl) | 80.94% | | Polish (pl) | 81.24% | | Portuguese (pt) | 94.63% | | Spanish (sp) | 81.61% | ## Toxicity Scoring An example on the usage of the toxicity classifiers: ```python # Define the model with an MLP classifier on top of XLM-RoBERTa class RobertaClassifier(nn.Module): def __init__(self, num_classes, model_name="FacebookAI/xlm-roberta-base", device="cuda:0"): super(RobertaClassifier, self).__init__() self.roberta = RobertaModel.from_pretrained(model_name) self.freeze_roberta_encoder() self.device = device self.classifier = nn.Sequential( nn.Linear(self.roberta.config.hidden_size, self.roberta.config.hidden_size), nn.ReLU(), nn.Dropout(0.2), nn.Linear(self.roberta.config.hidden_size, num_classes) ) def freeze_roberta_encoder(self): for param in self.roberta.parameters(): param.requires_grad = False def mean_pooling(self, model_output, attention_mask): import torch # https://huggingface.co/aditeyabaral/sentencetransformer-xlm-roberta-base token_embeddings = model_output.last_hidden_state # First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()) return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) def forward(self, input_ids=None, attention_mask=None, roberta_embeddings=None): # outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask) # pooled_output = outputs.last_hidden_state[:, 0] # CLS token representation if roberta_embeddings is None: outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask) roberta_embeddings = self.mean_pooling(outputs, attention_mask) logits = self.classifier(roberta_embeddings) return torch.nn.functional.softmax(logits, dim=1) def predict(self, input_ids=None, attention_mask=None, roberta_embeddings=None, **kwargs): """ Predicts class labels for a list of texts. Args: texts (list of str): The input sentences to classify. max_length (int): Maximum sequence length for tokenization. Returns: list of int: Predicted class labels for each input text. """ self.eval() with torch.no_grad(): if roberta_embeddings is None: logits = self(input_ids, attention_mask) else: logits = self(roberta_embeddings=roberta_embeddings) return logits[:,1].cpu().numpy() ``` ```python MODEL_PATH = f"{MODEL_DIR}/english.pth" DEVICE = "cpu" model = RobertaClassifier(device=DEVICE, num_classes=2) model.load_state_dict(state_dict=torch.load(MODEL_PATH, map_location=torch.device(DEVICE))) tokenizer = AutoTokenizer.from_pretrained("FacebookAI/xlm-roberta-base") document = ["I want to predict the toxicity score of this document: I am happy today.", "I want to predict the toxicity score of this document: this is a violent content!!"] inputs = tokenizer(document, return_tensors="pt", padding=True, truncation=True, max_length=512) model.predict(**inputs) # scores: [0.00121997, 0.9723031] ```