--- library_name: transformers tags: [] --- # 🎯 ClassiCC-PT Classifiers ## 📖 Overview The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data. They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline. **This repository contains the STEM classifier.** The classifiers provide document-level scores (0–5) for: Educational Content (ClassiCC-PT-edu) STEM Content (ClassiCC-PT-STEM) Toxic Content (ClassiCC-PT-toxic) ## 🏗 Training Setup Base model: BERTimbau Base Head: Linear regression layer Objective: Predict discrete scores (0–5) assigned by GPT-4o Optimizer: AdamW (lr = 3e-4) Scheduler: Cosine decay with 5% warmup Epochs: 20 Train Hardware: A100 gpus ## 📊 Performance All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive). | Classifier | Task | Test Size | Train Size | F1 (Binary) | | ----------------- | ----------------------- | --------- | ---------- | ----------- | | ClassiCC-PT-edu | Educational Content | 10k | 110k | **0.77** | | ClassiCC-PT-STEM | STEM Content | 12k | 100k | **0.76** | | ClassiCC-PT-toxic | Toxic/Offensive Content | 20k | 180k | **0.78** | For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models. ## 💡 Intended Use These classifiers were built for pretraining corpus filtering but can also be used for: Dataset annotation for educational/STEM/toxic content Research in Portuguese NLP content classification Filtering user-generated content in applications targeting Portuguese speakers ## Usage ``` from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "ClassiCC-Corpus/ClassiCC-PT-edu" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) outputs = model(**inputs) score = outputs.logits.squeeze(-1).float().cpu().numpy() print(f"Score: {score:.2f}") `` ## 📜 Citation If you use these classifiers, please cite: ``` coming soon ```