language: - multilingual license: apache-2.0

Running Model:

To run inference you must install

pip install transformers[torch]
pip install datasets
pip install pandas
pip install tqdm

After installing those libraries you can sun the following code:

import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from tqdm import tqdm


device = "cuda"
path = "Unbabel/mfineweb-edu-classifier"
model = AutoModelForSequenceClassification.from_pretrained(
    path, 
    device_map=device, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)

def get_model_outputs(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        score = outputs.logits
        prob = torch.nn.functional.sigmoid(outputs.binary_logits)
    return score.cpu(), prob.cpu()

def batchify_texts(texts, batch_size):
    for i in range(0, len(texts), batch_size):
        yield texts[i:i + batch_size]

# TODO: replace the next line with the texts you want to classify
texts = LIST_WITH_TEXTS_TO_CLASSIFY
batch_size = 64  # Adjust based on your available memory and model capacity
num_batches = (len(texts) + batch_size - 1) // batch_size

all_scores = []
all_probs = []
with tqdm(total=num_batches, dynamic_ncols=True) as pbar:
    for batch_num, batch in enumerate(batchify_texts(texts, batch_size), 1):
        score, probs = get_model_outputs(batch)
        all_scores.append(score)
        all_probs.append(probs)
        pbar.set_description(f"Processing Batch {batch_num}/{num_batches}")
        pbar.update(1)

# SCORES is the output of the regression head and should reflect the
# educational score of the text!
scores = torch.cat(all_scores, dim=0).squeeze()

## BINARY_PRED is the output of the classification head that tells
# if a text has an acceptable educational score or not.
# NOTE: Converting the scores into binary predictions is also possible
all_probs = torch.cat(all_probs, dim=0).squeeze()
binary_pred = (all_probs >= 0.5).numpy().astype(int)

English Results:

When testing the model on an english partition with 37537 samples the results are comparable to the original FineEdu-classifier.

Regression head results:

              precision    recall  f1-score   support

           0       0.80      0.53      0.64      5130
           1       0.80      0.88      0.83     21602
           2       0.63      0.58      0.61      7849
           3       0.54      0.62      0.58      2310
           4       0.62      0.48      0.54       645
           5       0.00      0.00      0.00         1

    accuracy                           0.74     37537
   macro avg       0.56      0.51      0.53     37537
weighted avg       0.74      0.74      0.74     37537

Binary head results:

              precision    recall  f1-score   support

           0       0.98      0.97      0.98     34581
           1       0.71      0.74      0.73      2956

    accuracy                           0.96     37537
   macro avg       0.85      0.86      0.85     37537
weighted avg       0.96      0.96      0.96     37537

Multilingual Results:

If we evaluate on the same texts translated into 15 different languages are almost identical!

Regression head results:

              precision    recall  f1-score   support

           0       0.80      0.50      0.61      5130
           1       0.79      0.87      0.83     21602
           2       0.61      0.58      0.59      7849
           3       0.52      0.61      0.56      2310
           4       0.61      0.38      0.47       645
           5       0.00      0.00      0.00         1

    accuracy                           0.73     37537
   macro avg       0.55      0.49      0.51     37537
weighted avg       0.73      0.73      0.73     37537

Binary head results:

              precision    recall  f1-score   support

           0       0.98      0.97      0.97     34581
           1       0.70      0.71      0.71      2956

    accuracy                           0.95     37537
   macro avg       0.84      0.84      0.84     37537
weighted avg       0.95      0.95      0.95     37537