--- license: apache-2.0 datasets: - arbml/SANAD language: - ar base_model: - answerdotai/ModernBERT-base pipeline_tag: text-classification library_name: transformers tags: - modernbert - arabic --- # ModernBERT Arabic Model Card ## Overview This is an Arabic version of ModernBERT, a modernized bidirectional encoder-only Transformer model (BERT-style). ModernBERT was pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. You can find more about the base ModernBERT model here: [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). For this proof of concept, a tokenizer trained on Arabic Wikipedia was utilized: - **Dataset:** Arabic Wikipedia - **Size:** 1.8 GB - **Tokens:** 228,788,529 tokens This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification. ## Model Details - **Epochs:** 3 - **Evaluation Metrics:** - **F1 Score:** 0.9587811491105839 - **Loss:** 0.19986020028591156 - **Runtime:** 46.4942 seconds - **Samples per second:** 305.006 - **Steps per second:** 38.134 - **Training Step:** 47,862 ## How to Use The model can be used for text classification using the `transformers` library. Below is an example: ```python from transformers import pipeline # Load model from huggingface.co/models using our repository ID classifier = pipeline( task="text-classification", model="ModernBERT-domain-classifier/checkpoint-47862", ) sample = ''' PUT SOME TEXT HERE TO CLASSIFY ITS TOPIC ''' classifier(sample) # [{'label': 'health', 'score': 0.6779336333274841}]