GeRaCl-USER2-base / README.md

MikhailVyrodov

Update README.md

fde2cd5 verified 4 months ago

preview code

raw

history blame

5.68 kB

metadata

license: apache-2.0
datasets:
  - deepvk/synthetic-classes
language:
  - ru
base_model:
  - deepvk/USER2-base
pipeline_tag: zero-shot-classification

GeRaCl-USER2-base

GeRaCl is a General Rapid Classifer designed to perform zero-shot classification tasks primarily on Russian texts.

This is a model with 155M parameters that is build on top of the USER2-base sentence encoder (149M) and is fine-tuned for zero-shot classification task.

Performance

To evaluate the model, we measure quality on multiclass classification tasks from the MTEB-rus benchmark.

MTEB-rus

Model	Size	Hidden Dim	Context Length	Mean(task)	Kinopoisk	Headlines	GRNTI	OECD	Inappropriateness
`GeRaCl-USER2-base`	155M	768	8192	0.65	0.61	0.80	0.63	0.48	0.71
`USER2-base`	149M	768	8192	0.52	0.50	0.65	0.56	0.39	0.51
`USER-bge-m3`	359M	1024	8192	0.53	0.60	0.73	0.43	0.28	0.62
`multilingual-e5-large-instruct`	560M	1024	512	0.63	0.56	0.83	0.62	0.46	0.67
`mDeBERTa-v3-base-mnli-xnli`	279M	768	512	0.45	0.54	0.53	0.34	0.23	0.62
`bge-m3-zeroshot-v2.0`	568M	1024	8192	0.60	0.65	0.72	0.53	0.41	0.67
`Qwen2.5-1.5B-Instruct`	1,5B	1536	128K	0.56	0.62	0.55	0.51	0.41	0.71
`Qwen2.5-3B-Instruct`	3B	2048	128K	0.63	0.63	0.74	0.60	0.43	0.75

Usage

Prefixes

This model is based on the USER2-base sentence encoder. This model uses the "classification: " prefix to work on classification tasks.

Code

Single classification scenario

from transformers import AutoTokenizer
from geracl import GeraclHF, ZeroShotClassificationPipeline

model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
tokenizer  = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')

pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")

text = "Утилизация катализаторов: как неплохо заработать"
labels = ["экономика", "происшествия", "политика", "культура", "наука", "спорт"]
result = pipe(text, labels, batch_size=1)[0]

print(labels[result])

Multiple classification scenarios

from transformers import AutoTokenizer
from geracl import GeraclHF, ZeroShotClassificationPipeline

model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
tokenizer  = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')

pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")

texts = [
  "Утилизация катализаторов: как неплохо заработать",
  "Мне не понравился этот фильм"
]
labels = [
  ["экономика", "происшествия", "политика", "культура", "наука", "спорт"],
  ["нейтральный", "позитивный", "негативный"]
]
results = pipe(texts, labels, batch_size=2)

for i in range(len(labels)):
    print(labels[i][results[i]])

Training details

This is the base version with 155 million parameters, based on USER2-base sentence encoder. This model uses the GLiNER architecture, but it has only one vector of similarity scores instead of a full matrix of similarities. Compared to the USER2-base model, there are two additional MLP layers. One is for the text embeddings and another is for the classes embeddings. You can see the detailed model's architecture on the picture below.

The training set is built entirely from splits of the deepvk/CLAZER dataset. It is a concatenation of three sub-datasets:

Synthetic classes part. For every training example we randomly chose one of the five class lists (classes_0…classes_4) and paired it with the sample’s text. The validation and test splits were added unchanged.
RU-MTEB part. The entire ru_mteb_classes dataset was added to the mix.
RU-MTEB extended part. The entire ru_mteb_extended_classes dataset was added to the mix.

Dataset	# Samples
CLAZER/synthetic_classes_train	93K
CLAZER/synthetic_classes (val and test)	6K
CLAZER/ru_mteb_classes	52K
CLAZER/ru_mteb_extended_classes	93K
Total	244K

Citations

@misc{deepvk2025geracl,
    title={GeRaCl},
    author={Vyrodov, Mikhail and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/GeRaCl-USER2-base},
    publisher={Hugging Face}
    year={2025},
}