CodeModernBERT-Crow-v1-Pre
Model Description
CodeModernBERT-Crow-v1-Pre is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language. It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.
License: Apache-2.0
Supported Languages: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
Datasets:
Pipeline tag:
fill-mask
This model is a pretraining checkpoint, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.
Training Objective
The model was pretrained on large-scale multilingual code corpora with the following goals:
- Learn robust code representations across multiple programming languages.
- Capture semantic relations between code tokens and natural language descriptions.
- Provide a strong initialization point for fine-tuning on code-related downstream tasks.
Tokenizer
A custom BPE tokenizer was trained for code and docstrings.
Vocabulary size: 50,368
Special tokens: Standard Hugging Face special tokens + custom tokens for code/document structure.
Training process:
- Up to 1M examples per dataset.
- Each example truncated to 10,000 characters.
- Trained with files from multiple datasets (see above).
Architecture
- Base: ModernBERT
- Hidden size: 768
- Number of layers: 12
- Attention heads: 12
- Intermediate size: 3072
- Max sequence length: 8192 (during training, inputs were limited to 1024 tokens)
- RoPE positional encoding: supported
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)
The model can be fine-tuned for:
- Code search (query ↔ code retrieval)
- Code clone detection
- Code summarization (docstring prediction)
- Bug detection and repair (masked language modeling or cloze-style)
Limitations
- The model is not optimized for direct code generation.
- Pretraining does not guarantee correctness of code execution.
- Fine-tuning is recommended for specific downstream applications.
Intended Use
- Research in software engineering and natural language processing for code.
- Educational exploration of pretrained models for code tasks.
- Baseline for continued pretraining or fine-tuning.
- Downloads last month
- 33