mbley's picture
Update README.md
80dbde5 verified
|
raw
history blame
14.9 kB
metadata
tags:
  - setfit
  - sentence-transformers
  - text-classification
  - generated_from_setfit_trainer
widget:
  - text: 03771 290230 oder 03771 2534030
  - text: >-
      Seit Jahresbeginn konnten wieder etliche Praxisbeispiele aus den
      hessischen Regionen für die OloV - Website aufbereitet oder mit aktuellen
      Entwicklungen ergänzt werden.
  - text:  Greding     A9     
  - text: 'Vortrag : SIMCP WORKSHOP, online ( eingeladen ) ; 16. 11.'
  - text: >-
      Nicht nur bei Fragen zur von Smart CROSSBLADE Dachboxen hilft unser neu
      gestalteter weiter.
pipeline_tag: text-classification
library_name: setfit
inference: false
license: mit
datasets:
  - mbley/german-webtext-quality-classification-dataset
language:
  - de
base_model:
  - distilbert/distilbert-base-multilingual-cased

Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning (RANLP25)

A multi-label sentence classifier trained with Active Learning for predicting high- or low-qality labels of german webtext. Training and evaluation code: https://github.com/maximilian-bley/german-webtext-quality-classification

Model Details

  • Labels

  • 0=Sentence Boundary: Sentence boundary errors occur if the start or ending of a sentence is malformed. This is the case if it begins with a lower case letter or an atypical character, or lacks a proper terminal punctuation mark (e.g., period, exclamation mark, or question mark).

  • 1=Grammar Mistake: Grammar mistakes are any grammatical errors such as incorrect articles, cases, word order and incorrect use or absence of words. Moreover, random-looking sequences of words, usually series of nouns, should be tagged. In most cases where this label is applicable, the sentence' comprehensibility or message is impaired.

  • 2=Spelling Anomaly: A spelling anomaly is tagged when a word does not correspond to German spelling. This includes typos and incorrect capitalization (e.g. “all caps” or lower-case nouns). Spelling anomalies are irregularities that occur within the word boundary, meaning here text between two whitespaces. In particular, individual letters or nonsensical word fragments are also tagged.

  • 3=Punctuation Error: Punctuation errors are tagged if a punctuation symbol has been placed incorrectly or is missing in the intended place. This includes comma errors, missing quotation marks or parentheses, periods instead of question marks or incorrect or missing dashes or hyphens.

  • 4=Non-linguistic Content: Non-linguistic content includes all types of encoding errors, language-atypical occurrences of numbers and characters (e.g. random sequences of characters or letters), code (remnants), URLs, hashtags and emoticons.

  • 5=Letter Spacing: Letter spacings are deliberately inserted spaces between the characters of a word.

  • 6=Clean: Assigned if none of the other labels apply.

Model Description

  • Model Type: SetFit
  • Classification head: a SetFitHead instance
  • Maximum Sequence Length: 512 tokens Number of Classes: 6 Language: German

Model Sources

  • Repository:
  • Paper:

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("setfit_model_id")
# Run inference
preds = model("在 Greding 出 口 离 开 A9 高 速 公 路 。")

Training Details

Training Hyperparameters

  • batch_size: (8, 8)
  • num_epochs: (1, 16)
  • max_steps: -1
  • sampling_strategy: oversampling
  • body_learning_rate: (2e-05, 1e-05)
  • head_learning_rate: 0.01
  • loss: CoSENTLoss
  • distance_metric: cosine_distance
  • margin: 0.25
  • end_to_end: True
  • use_amp: False
  • warmup_proportion: 0.1
  • l2_weight: 0.01
  • max_length: 512
  • seed: 13579
  • eval_max_steps: -1
  • load_best_model_at_end: False

Training Results

Epoch Step Training Loss Validation Loss
0.0001 1 4.5018 -
0.0060 100 5.2045 -
0.0119 200 4.559 -
0.0179 300 3.4579 -
0.0239 400 3.106 -
0.0298 500 2.7464 -
0.0358 600 2.5813 -
0.0417 700 2.5341 -
0.0477 800 2.5279 -
0.0537 900 2.361 -
0.0596 1000 2.2318 -
0.0656 1100 1.8437 -
0.0716 1200 1.6423 -
0.0775 1300 1.7572 -
0.0835 1400 1.8163 -
0.0895 1500 1.4293 -
0.0954 1600 1.3842 -
0.1014 1700 0.9845 -
0.1073 1800 1.0666 -
0.1133 1900 0.6876 -
0.1193 2000 1.4398 -
0.1252 2100 0.7268 -
0.1312 2200 0.7272 -
0.1372 2300 0.9801 -
0.1431 2400 0.6159 -
0.1491 2500 0.465 -
0.1551 2600 1.0453 -
0.1610 2700 0.565 -
0.1670 2800 0.4328 -
0.1729 2900 0.5229 -
0.1789 3000 0.5581 -
0.1849 3100 0.1847 -
0.1908 3200 0.4755 -
0.1968 3300 0.8408 -
0.2028 3400 0.4852 -
0.2087 3500 0.6054 -
0.2147 3600 0.4868 -
0.2207 3700 0.4138 -
0.2266 3800 0.9303 -
0.2326 3900 0.3892 -
0.2385 4000 0.3462 -
0.2445 4100 0.3579 -
0.2505 4200 0.203 -
0.2564 4300 0.4673 -
0.2624 4400 0.1183 -
0.2684 4500 0.506 -
0.2743 4600 0.1378 -
0.2803 4700 0.1603 -
0.2863 4800 0.2337 -
0.2922 4900 0.1526 -
0.2982 5000 0.3597 -
0.3042 5100 0.0672 -
0.3101 5200 0.2134 -
0.3161 5300 0.3521 -
0.3220 5400 0.1098 -
0.3280 5500 0.0723 -
0.3340 5600 0.0349 -
0.3399 5700 0.1389 -
0.3459 5800 0.0966 -
0.3519 5900 0.0998 -
0.3578 6000 0.0263 -
0.3638 6100 0.2343 -
0.3698 6200 0.0776 -
0.3757 6300 0.0037 -
0.3817 6400 0.1324 -
0.3876 6500 0.1259 -
0.3936 6600 0.0197 -
0.3996 6700 0.048 -
0.4055 6800 0.077 -
0.4115 6900 0.025 -
0.4175 7000 0.1416 -
0.4234 7100 0.0622 -
0.4294 7200 0.0625 -
0.4354 7300 0.0281 -
0.4413 7400 0.0308 -
0.4473 7500 0.0675 -
0.4532 7600 0.0551 -
0.4592 7700 0.0174 -
0.4652 7800 0.0719 -
0.4711 7900 0.0426 -
0.4771 8000 0.0231 -
0.4831 8100 0.0253 -
0.4890 8200 0.0106 -
0.4950 8300 0.0199 -
0.5010 8400 0.0181 -
0.5069 8500 0.0136 -
0.5129 8600 0.0378 -
0.5188 8700 0.0151 -
0.5248 8800 0.002 -
0.5308 8900 0.0008 -
0.5367 9000 0.0025 -
0.5427 9100 0.0125 -
0.5487 9200 0.0112 -
0.5546 9300 0.0019 -
0.5606 9400 0.0265 -
0.5666 9500 0.017 -
0.5725 9600 0.0133 -
0.5785 9700 0.0324 -
0.5844 9800 0.0067 -
0.5904 9900 0.0032 -
0.5964 10000 0.0133 -
0.6023 10100 0.0014 -
0.6083 10200 0.0075 -
0.6143 10300 0.0142 -
0.6202 10400 0.0074 -
0.6262 10500 0.0446 -
0.6322 10600 0.0701 -
0.6381 10700 0.0039 -
0.6441 10800 0.0042 -
0.6500 10900 0.004 -
0.6560 11000 0.0009 -
0.6620 11100 0.0007 -
0.6679 11200 0.0012 -
0.6739 11300 0.0178 -
0.6799 11400 0.0024 -
0.6858 11500 0.0006 -
0.6918 11600 0.0011 -
0.6978 11700 0.0043 -
0.7037 11800 0.0013 -
0.7097 11900 0.0019 -
0.7156 12000 0.0025 -
0.7216 12100 0.0004 -
0.7276 12200 0.0065 -
0.7335 12300 0.001 -
0.7395 12400 0.0013 -
0.7455 12500 0.0036 -
0.7514 12600 0.0027 -
0.7574 12700 0.0015 -
0.7634 12800 0.0004 -
0.7693 12900 0.0102 -
0.7753 13000 0.0035 -
0.7812 13100 0.0003 -
0.7872 13200 0.0003 -
0.7932 13300 0.0001 -
0.7991 13400 0.0024 -
0.8051 13500 0.0009 -
0.8111 13600 0.0004 -
0.8170 13700 0.0002 -
0.8230 13800 0.0002 -
0.8290 13900 0.0005 -
0.8349 14000 0.0015 -
0.8409 14100 0.0035 -
0.8469 14200 0.0004 -
0.8528 14300 0.0003 -
0.8588 14400 0.0006 -
0.8647 14500 0.0002 -
0.8707 14600 0.0002 -
0.8767 14700 0.0004 -
0.8826 14800 0.0002 -
0.8886 14900 0.0004 -
0.8946 15000 0.0001 -
0.9005 15100 0.0004 -
0.9065 15200 0.0004 -
0.9125 15300 0.0003 -
0.9184 15400 0.0002 -
0.9244 15500 0.0001 -
0.9303 15600 0.0002 -
0.9363 15700 0.0004 -
0.9423 15800 0.0002 -
0.9482 15900 0.0004 -
0.9542 16000 0.0005 -
0.9602 16100 0.0002 -
0.9661 16200 0.0003 -
0.9721 16300 0.0001 -
0.9781 16400 0.0001 -
0.9840 16500 0.0002 -
0.9900 16600 0.0003 -
0.9959 16700 0.0005 -

Framework Versions

  • Python: 3.10.4
  • SetFit: 1.1.2
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.3
  • PyTorch: 2.7.0+cu126
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}