Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: Sameer Paymode
  • Model type: BPE-Tokenizer
  • Language(s) (NLP): English
  • License: No License Needed

Get Started with the Model

This model is a Byte Pair Encoding (BPE) tokenizer trained on the Salesforce/wikitext dataset.

Use the following code to load and try out the tokenizer:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("sameerpaymode/custom-bpe-text-tokenizer")

text = "The quick brown fox jumps over the lazy dog."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print("Token IDs:", ids)
print("Decoded Text:", decoded)

Uses

  • Preprocessing Text for Language Models

    • It transforms raw text into input IDs (token IDs) that a model can understand.

    • Used in tasks like text classification, question answering, translation, summarization, etc.

  • Pairing with Custom Models

    • If someone trains a transformer model using your tokenizer, they must use the same tokenizer at inference to get consistent results.
  • Fast Tokenization at Scale

    • Your tokenizer can be used for fast and efficient preprocessing in production NLP systems.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

https://huggingface.co/datasets/Salesforce/wikitext

Evaluation

Results

  • Vocabulary size: 30000
  • Tokenization is consistent: True
  • Average tokens per sentence (validation): 99.25
  • Average tokens per sentence (test): 96.24
  • Compression ratio (val): 4.55 chars/token
  • Compression ratio (test): 4.47 chars/token
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train sameerpaymode/custom-bpe-text-tokenizer