---
library_name: transformers
tags: []
---

# 🎯 ClassiCC-PT Classifiers

## 📖 Overview

The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data.
They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline.

**This repository contains the STEM classifier.**

The classifiers provide document-level scores (0–5) for:

    Educational Content (ClassiCC-PT-edu)

    STEM Content (ClassiCC-PT-STEM)

    Toxic Content (ClassiCC-PT-toxic)


## 🏗 Training Setup

    Base model: BERTimbau Base

    Head: Linear regression layer

    Objective: Predict discrete scores (0–5) assigned by GPT-4o

    Optimizer: AdamW (lr = 3e-4)

    Scheduler: Cosine decay with 5% warmup

    Epochs: 20

    Train Hardware: A100 gpus


## 📊 Performance

All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive).

| Classifier        | Task                    | Test Size | Train Size | F1 (Binary) |
| ----------------- | ----------------------- | --------- | ---------- | ----------- |
| ClassiCC-PT-edu   | Educational Content     | 10k       | 110k       | **0.77**    |
| ClassiCC-PT-STEM  | STEM Content            | 12k       | 100k       | **0.76**    |
| ClassiCC-PT-toxic | Toxic/Offensive Content | 20k       | 180k       | **0.78**    |

For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models.


## 💡 Intended Use

These classifiers were built for pretraining corpus filtering but can also be used for:

    Dataset annotation for educational/STEM/toxic content

    Research in Portuguese NLP content classification

    Filtering user-generated content in applications targeting Portuguese speakers


## Usage

```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ClassiCC-Corpus/ClassiCC-PT-edu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
score = outputs.logits.squeeze(-1).float().cpu().numpy()
print(f"Score: {score:.2f}")
``

## 📜 Citation

If you use these classifiers, please cite:
```
coming soon
```