Update README.md

80dbde5 verified about 2 months ago

14.9 kB

metadata

tags:
  - setfit
  - sentence-transformers
  - text-classification
  - generated_from_setfit_trainer
widget:
  - text: 03771 290230 oder 03771 2534030
  - text: >-
      Seit Jahresbeginn konnten wieder etliche Praxisbeispiele aus den
      hessischen Regionen für die OloV - Website aufbereitet oder mit aktuellen
      Entwicklungen ergänzt werden.
  - text: 在 Greding 出 口 离 开 A9 高 速 公 路 。
  - text: 'Vortrag : SIMCP WORKSHOP, online ( eingeladen ) ; 16. 11.'
  - text: >-
      Nicht nur bei Fragen zur von Smart CROSSBLADE Dachboxen hilft unser neu
      gestalteter weiter.
pipeline_tag: text-classification
library_name: setfit
inference: false
license: mit
datasets:
  - mbley/german-webtext-quality-classification-dataset
language:
  - de
base_model:
  - distilbert/distilbert-base-multilingual-cased

Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning (RANLP25)

A multi-label sentence classifier trained with Active Learning for predicting high- or low-qality labels of german webtext. Training and evaluation code: https://github.com/maximilian-bley/german-webtext-quality-classification

Model Details

Labels
0=Sentence Boundary: Sentence boundary errors occur if the start or ending of a sentence is malformed. This is the case if it begins with a lower case letter or an atypical character, or lacks a proper terminal punctuation mark (e.g., period, exclamation mark, or question mark).
1=Grammar Mistake: Grammar mistakes are any grammatical errors such as incorrect articles, cases, word order and incorrect use or absence of words. Moreover, random-looking sequences of words, usually series of nouns, should be tagged. In most cases where this label is applicable, the sentence' comprehensibility or message is impaired.
2=Spelling Anomaly: A spelling anomaly is tagged when a word does not correspond to German spelling. This includes typos and incorrect capitalization (e.g. “all caps” or lower-case nouns). Spelling anomalies are irregularities that occur within the word boundary, meaning here text between two whitespaces. In particular, individual letters or nonsensical word fragments are also tagged.
3=Punctuation Error: Punctuation errors are tagged if a punctuation symbol has been placed incorrectly or is missing in the intended place. This includes comma errors, missing quotation marks or parentheses, periods instead of question marks or incorrect or missing dashes or hyphens.
4=Non-linguistic Content: Non-linguistic content includes all types of encoding errors, language-atypical occurrences of numbers and characters (e.g. random sequences of characters or letters), code (remnants), URLs, hashtags and emoticons.
5=Letter Spacing: Letter spacings are deliberately inserted spaces between the characters of a word.
6=Clean: Assigned if none of the other labels apply.

Model Description

Model Type: SetFit
Classification head: a SetFitHead instance
Maximum Sequence Length: 512 tokens Number of Classes: 6 Language: German

Model Sources

Repository:
Paper:

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("setfit_model_id")
# Run inference
preds = model("在 Greding 出 口 离 开 A9 高 速 公 路 。")

Training Details

Training Hyperparameters

batch_size: (8, 8)
num_epochs: (1, 16)
max_steps: -1
sampling_strategy: oversampling
body_learning_rate: (2e-05, 1e-05)
head_learning_rate: 0.01
loss: CoSENTLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: True
use_amp: False
warmup_proportion: 0.1
l2_weight: 0.01
max_length: 512
seed: 13579
eval_max_steps: -1
load_best_model_at_end: False

Training Results

Epoch	Step	Training Loss	Validation Loss
0.0001	1	4.5018	-
0.0060	100	5.2045	-
0.0119	200	4.559	-
0.0179	300	3.4579	-
0.0239	400	3.106	-
0.0298	500	2.7464	-
0.0358	600	2.5813	-
0.0417	700	2.5341	-
0.0477	800	2.5279	-
0.0537	900	2.361	-
0.0596	1000	2.2318	-
0.0656	1100	1.8437	-
0.0716	1200	1.6423	-
0.0775	1300	1.7572	-
0.0835	1400	1.8163	-
0.0895	1500	1.4293	-
0.0954	1600	1.3842	-
0.1014	1700	0.9845	-
0.1073	1800	1.0666	-
0.1133	1900	0.6876	-
0.1193	2000	1.4398	-
0.1252	2100	0.7268	-
0.1312	2200	0.7272	-
0.1372	2300	0.9801	-
0.1431	2400	0.6159	-
0.1491	2500	0.465	-
0.1551	2600	1.0453	-
0.1610	2700	0.565	-
0.1670	2800	0.4328	-
0.1729	2900	0.5229	-
0.1789	3000	0.5581	-
0.1849	3100	0.1847	-
0.1908	3200	0.4755	-
0.1968	3300	0.8408	-
0.2028	3400	0.4852	-
0.2087	3500	0.6054	-
0.2147	3600	0.4868	-
0.2207	3700	0.4138	-
0.2266	3800	0.9303	-
0.2326	3900	0.3892	-
0.2385	4000	0.3462	-
0.2445	4100	0.3579	-
0.2505	4200	0.203	-
0.2564	4300	0.4673	-
0.2624	4400	0.1183	-
0.2684	4500	0.506	-
0.2743	4600	0.1378	-
0.2803	4700	0.1603	-
0.2863	4800	0.2337	-
0.2922	4900	0.1526	-
0.2982	5000	0.3597	-
0.3042	5100	0.0672	-
0.3101	5200	0.2134	-
0.3161	5300	0.3521	-
0.3220	5400	0.1098	-
0.3280	5500	0.0723	-
0.3340	5600	0.0349	-
0.3399	5700	0.1389	-
0.3459	5800	0.0966	-
0.3519	5900	0.0998	-
0.3578	6000	0.0263	-
0.3638	6100	0.2343	-
0.3698	6200	0.0776	-
0.3757	6300	0.0037	-
0.3817	6400	0.1324	-
0.3876	6500	0.1259	-
0.3936	6600	0.0197	-
0.3996	6700	0.048	-
0.4055	6800	0.077	-
0.4115	6900	0.025	-
0.4175	7000	0.1416	-
0.4234	7100	0.0622	-
0.4294	7200	0.0625	-
0.4354	7300	0.0281	-
0.4413	7400	0.0308	-
0.4473	7500	0.0675	-
0.4532	7600	0.0551	-
0.4592	7700	0.0174	-
0.4652	7800	0.0719	-
0.4711	7900	0.0426	-
0.4771	8000	0.0231	-
0.4831	8100	0.0253	-
0.4890	8200	0.0106	-
0.4950	8300	0.0199	-
0.5010	8400	0.0181	-
0.5069	8500	0.0136	-
0.5129	8600	0.0378	-
0.5188	8700	0.0151	-
0.5248	8800	0.002	-
0.5308	8900	0.0008	-
0.5367	9000	0.0025	-
0.5427	9100	0.0125	-
0.5487	9200	0.0112	-
0.5546	9300	0.0019	-
0.5606	9400	0.0265	-
0.5666	9500	0.017	-
0.5725	9600	0.0133	-
0.5785	9700	0.0324	-
0.5844	9800	0.0067	-
0.5904	9900	0.0032	-
0.5964	10000	0.0133	-
0.6023	10100	0.0014	-
0.6083	10200	0.0075	-
0.6143	10300	0.0142	-
0.6202	10400	0.0074	-
0.6262	10500	0.0446	-
0.6322	10600	0.0701	-
0.6381	10700	0.0039	-
0.6441	10800	0.0042	-
0.6500	10900	0.004	-
0.6560	11000	0.0009	-
0.6620	11100	0.0007	-
0.6679	11200	0.0012	-
0.6739	11300	0.0178	-
0.6799	11400	0.0024	-
0.6858	11500	0.0006	-
0.6918	11600	0.0011	-
0.6978	11700	0.0043	-
0.7037	11800	0.0013	-
0.7097	11900	0.0019	-
0.7156	12000	0.0025	-
0.7216	12100	0.0004	-
0.7276	12200	0.0065	-
0.7335	12300	0.001	-
0.7395	12400	0.0013	-
0.7455	12500	0.0036	-
0.7514	12600	0.0027	-
0.7574	12700	0.0015	-
0.7634	12800	0.0004	-
0.7693	12900	0.0102	-
0.7753	13000	0.0035	-
0.7812	13100	0.0003	-
0.7872	13200	0.0003	-
0.7932	13300	0.0001	-
0.7991	13400	0.0024	-
0.8051	13500	0.0009	-
0.8111	13600	0.0004	-
0.8170	13700	0.0002	-
0.8230	13800	0.0002	-
0.8290	13900	0.0005	-
0.8349	14000	0.0015	-
0.8409	14100	0.0035	-
0.8469	14200	0.0004	-
0.8528	14300	0.0003	-
0.8588	14400	0.0006	-
0.8647	14500	0.0002	-
0.8707	14600	0.0002	-
0.8767	14700	0.0004	-
0.8826	14800	0.0002	-
0.8886	14900	0.0004	-
0.8946	15000	0.0001	-
0.9005	15100	0.0004	-
0.9065	15200	0.0004	-
0.9125	15300	0.0003	-
0.9184	15400	0.0002	-
0.9244	15500	0.0001	-
0.9303	15600	0.0002	-
0.9363	15700	0.0004	-
0.9423	15800	0.0002	-
0.9482	15900	0.0004	-
0.9542	16000	0.0005	-
0.9602	16100	0.0002	-
0.9661	16200	0.0003	-
0.9721	16300	0.0001	-
0.9781	16400	0.0001	-
0.9840	16500	0.0002	-
0.9900	16600	0.0003	-
0.9959	16700	0.0005	-

Framework Versions

Python: 3.10.4
SetFit: 1.1.2
Sentence Transformers: 4.1.0
Transformers: 4.52.3
PyTorch: 2.7.0+cu126
Datasets: 3.6.0
Tokenizers: 0.21.1

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}