Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Sameer Paymode
Model type: BPE-Tokenizer
Language(s) (NLP): English
License: No License Needed

Get Started with the Model

This model is a Byte Pair Encoding (BPE) tokenizer trained on the Salesforce/wikitext dataset.

Use the following code to load and try out the tokenizer:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("sameerpaymode/custom-bpe-text-tokenizer")

text = "The quick brown fox jumps over the lazy dog."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print("Token IDs:", ids)
print("Decoded Text:", decoded)

Uses

Preprocessing Text for Language Models
- It transforms raw text into input IDs (token IDs) that a model can understand.
- Used in tasks like text classification, question answering, translation, summarization, etc.
Pairing with Custom Models
- If someone trains a transformer model using your tokenizer, they must use the same tokenizer at inference to get consistent results.
Fast Tokenization at Scale
- Your tokenizer can be used for fast and efficient preprocessing in production NLP systems.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

https://huggingface.co/datasets/Salesforce/wikitext

Evaluation

Results

Vocabulary size: 30000
Tokenization is consistent: True
Average tokens per sentence (validation): 99.25
Average tokens per sentence (test): 96.24
Compression ratio (val): 4.55 chars/token
Compression ratio (test): 4.47 chars/token

sameerpaymode
/

custom-bpe-text-tokenizer