Model Card for Model ID
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Sameer Paymode
- Model type: BPE-Tokenizer
- Language(s) (NLP): English
- License: No License Needed
Get Started with the Model
This model is a Byte Pair Encoding (BPE) tokenizer trained on the Salesforce/wikitext dataset.
Use the following code to load and try out the tokenizer:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("sameerpaymode/custom-bpe-text-tokenizer")
text = "The quick brown fox jumps over the lazy dog."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print("Token IDs:", ids)
print("Decoded Text:", decoded)
Uses
Preprocessing Text for Language Models
It transforms raw text into input IDs (token IDs) that a model can understand.
Used in tasks like text classification, question answering, translation, summarization, etc.
Pairing with Custom Models
- If someone trains a transformer model using your tokenizer, they must use the same tokenizer at inference to get consistent results.
Fast Tokenization at Scale
- Your tokenizer can be used for fast and efficient preprocessing in production NLP systems.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
Training Details
Training Data
https://huggingface.co/datasets/Salesforce/wikitext
Evaluation
Results
- Vocabulary size: 30000
- Tokenization is consistent: True
- Average tokens per sentence (validation): 99.25
- Average tokens per sentence (test): 96.24
- Compression ratio (val): 4.55 chars/token
- Compression ratio (test): 4.47 chars/token