|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- bigcode/starcoderdata |
|
- bigcode/starcoder2data-extras |
|
language: |
|
- en |
|
tags: |
|
- code |
|
- python |
|
- java |
|
- javascript |
|
- typescript |
|
- go |
|
- rust |
|
- php |
|
- ruby |
|
- cpp |
|
- c |
|
- sql |
|
--- |
|
# CodeModernBERT-Crow-v1-Pre |
|
|
|
## Model Description |
|
|
|
**CodeModernBERT-Crow-v1-Pre** is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language. |
|
It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories. |
|
|
|
* **License**: Apache-2.0 |
|
* **Supported Languages**: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL |
|
* **Datasets**: |
|
|
|
* [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
|
* [bigcode/starcoder2data-extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) |
|
* **Pipeline tag**: `fill-mask` |
|
|
|
This model is a **pretraining checkpoint**, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization. |
|
|
|
--- |
|
|
|
## Training Objective |
|
|
|
The model was pretrained on large-scale multilingual code corpora with the following goals: |
|
|
|
* Learn robust code representations across multiple programming languages. |
|
* Capture semantic relations between code tokens and natural language descriptions. |
|
* Provide a strong initialization point for fine-tuning on code-related downstream tasks. |
|
|
|
--- |
|
|
|
## Tokenizer |
|
|
|
A custom **BPE tokenizer** was trained for code and docstrings. |
|
|
|
* **Vocabulary size**: 50,368 |
|
* **Special tokens**: Standard Hugging Face special tokens + custom tokens for code/document structure. |
|
* **Training process**: |
|
|
|
* Up to 1M examples per dataset. |
|
* Each example truncated to 10,000 characters. |
|
* Trained with files from multiple datasets (see above). |
|
|
|
--- |
|
|
|
## Architecture |
|
|
|
* **Base**: ModernBERT |
|
* **Hidden size**: 768 |
|
* **Number of layers**: 12 |
|
* **Attention heads**: 12 |
|
* **Intermediate size**: 3072 |
|
* **Max sequence length: 8192** (during training, inputs were limited to 1024 tokens) |
|
* **RoPE positional encoding**: supported |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre") |
|
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre") |
|
|
|
inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
The model can be fine-tuned for: |
|
|
|
* Code search (query ↔ code retrieval) |
|
* Code clone detection |
|
* Code summarization (docstring prediction) |
|
* Bug detection and repair (masked language modeling or cloze-style) |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
* The model is not optimized for direct code generation. |
|
* Pretraining does not guarantee correctness of code execution. |
|
* Fine-tuning is recommended for specific downstream applications. |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
* Research in software engineering and natural language processing for code. |
|
* Educational exploration of pretrained models for code tasks. |
|
* Baseline for continued pretraining or fine-tuning. |
|
|
|
--- |