|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: BAAI/bge-small-en-v1.5 |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: bge-small-en-v1.5-ultrafineweb-vs-pile-classifier |
|
results: [] |
|
datasets: |
|
- openbmb/Ultra-FineWeb |
|
- EleutherAI/the_pile_deduplicated |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# bge-small-en-v1.5-ultrafineweb-vs-pile-classifier |
|
|
|
> [!IMPORTANT] |
|
> **Note:** This model is provided for reference and reproducibility, not for standalone use. |
|
|
|
This model is a fine-tuned version of [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |
|
to classify text as high quality or low quality for AI training. |
|
|
|
- Trained on 100k samples from [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (high quality) and 100k from [EleutherAI/the_pile_deduplicated](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated) (low quality) |
|
- 80% training / 20% validation split |
|
|
|
On the validation set: |
|
- Loss: 0.2926 |
|
- Accuracy: 0.9061 |
|
- Combined Score: 2.1448 |
|
- Tokens processed: 102,184,960 |
|
|
|
## Example |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline("text-classification", model="agentlans/bge-small-en-v1.5-ultrafineweb-vs-pile-classifier") |
|
classifier("Your text here.") |
|
``` |
|
|
|
## Limitations |
|
- Tends to be overly strict, labelling most texts outside training data as low quality |
|
- English only |
|
- May be biased against some text types such as source code and personal blogs |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- num_epochs: 5.0 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Combined Score | Input Tokens Seen | |
|
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:--------------:|:-----------------:| |
|
| 0.2893 | 1.0 | 19958 | 0.2926 | 0.9061 | 2.1448 | 20436992 | |
|
| 0.2397 | 2.0 | 39916 | 0.3127 | 0.9076 | 2.1194 | 40873984 | |
|
| 0.2 | 3.0 | 59874 | 0.3279 | 0.9109 | 2.0605 | 61310976 | |
|
| 0.1576 | 4.0 | 79832 | 0.3887 | 0.9080 | 2.1119 | 81747968 | |
|
| 0.1127 | 5.0 | 99790 | 0.4688 | 0.9069 | 2.1308 | 102184960 | |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.51.3 |
|
- Pytorch 2.6.0+cu124 |
|
- Datasets 3.2.0 |
|
- Tokenizers 0.21.0 |