Multilingual Toxicity Classifiers used in Apertus Pretraining

Author: Olivia Simin Fan (@Olivia-umich)

Language specific toxicity classifiers in English, French, German, Italian, Spanish, Portuguese, Polish, Chinese and Dutch, trained on PleIAs/ToxicCommons and SWSR-SexComments datasets. We subsample non-toxic examples to create balanced 50%-50% training sets for each language. We separate 10% from the balanced dataset to build a balanced held-out validation set.

Model Description

Our toxicity classifier employs a two-stage approach: we first extract the multilingual document embeddings using XLM-RoBERTa, then train a language-specific 2-layer MLP for binary toxicity classification on top of these embeddings for 6 epochs. The classifier checkpoints with the best accuracy on the held-out validation set are further employed to annotate the toxicity scores on FineWeb-2 and FineWeb.

The validation accuracies on the balanced held-out test set are provided as below:

Language	Accuracy
English (en)	80.13%
Chinese (zh)	79.64%
French (fr)	82.34%
German (de)	82.61%
Italian (it)	82.16%
Dutch (nl)	80.94%
Polish (pl)	81.24%
Portuguese (pt)	94.63%
Spanish (sp)	81.61%

Toxicity Scoring

An example on the usage of the toxicity classifiers:

# Define the model with an MLP classifier on top of XLM-RoBERTa
class RobertaClassifier(nn.Module):
    def __init__(self, num_classes,
                 model_name="FacebookAI/xlm-roberta-base",
                 device="cuda:0"):
        super(RobertaClassifier, self).__init__()
        self.roberta = RobertaModel.from_pretrained(model_name)
        self.freeze_roberta_encoder()
        self.device = device
        self.classifier = nn.Sequential(
            nn.Linear(self.roberta.config.hidden_size, self.roberta.config.hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(self.roberta.config.hidden_size, num_classes)
        )

    def freeze_roberta_encoder(self):
        for param in self.roberta.parameters():
            param.requires_grad = False

    def mean_pooling(self, model_output, attention_mask):
        import torch
        # https://huggingface.co/aditeyabaral/sentencetransformer-xlm-roberta-base
        token_embeddings = model_output.last_hidden_state # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def forward(self, input_ids=None, attention_mask=None,
                roberta_embeddings=None):
        # outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        # pooled_output = outputs.last_hidden_state[:, 0]  # CLS token representation
        if roberta_embeddings is None:
            outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
            roberta_embeddings = self.mean_pooling(outputs, attention_mask)
        logits = self.classifier(roberta_embeddings)
        return torch.nn.functional.softmax(logits, dim=1)

    def predict(self, input_ids=None, attention_mask=None,
                roberta_embeddings=None, **kwargs):
        """
        Predicts class labels for a list of texts.

        Args:
            texts (list of str): The input sentences to classify.
            max_length (int): Maximum sequence length for tokenization.

        Returns:
            list of int: Predicted class labels for each input text.
        """
        self.eval()

        with torch.no_grad():
            if roberta_embeddings is None:
                logits = self(input_ids, attention_mask)
            else:
                logits = self(roberta_embeddings=roberta_embeddings)
        return logits[:,1].cpu().numpy()

LANGUAGE = "english" # choose from ["english", "chinese", "french", "german", "italian", "spanish", "portuguese", "polish", "dutch"]
MODEL_PATH = f"{MODEL_DIR}/{LANGUAGE}.pth"
DEVICE = "cpu"

model = RobertaClassifier(device=DEVICE, num_classes=2)
model.load_state_dict(state_dict=torch.load(MODEL_PATH, map_location=torch.device(DEVICE)))
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")

document = ["I want to predict the toxicity score of this document: I am happy today.",
            "I want to predict the toxicity score of this document: this is a violent content!!"]

inputs = tokenizer(document, return_tensors="pt", padding=True, truncation=True, max_length=512)
model.predict(**inputs) # scores: [0.00121997, 0.9723031]

swiss-ai
/

apertus-pretrain-toxicity

Multilingual Toxicity Classifiers used in Apertus Pretraining

Author: Olivia Simin Fan (@Olivia-umich)

Model Description

Toxicity Scoring

Model tree for swiss-ai/apertus-pretrain-toxicity

Datasets used to train swiss-ai/apertus-pretrain-toxicity