Shuu12121's picture
Update README.md
0682419 verified
---
license: apache-2.0
datasets:
- bigcode/starcoderdata
- bigcode/starcoder2data-extras
language:
- en
tags:
- code
- python
- java
- javascript
- typescript
- go
- rust
- php
- ruby
- cpp
- c
- sql
---
# CodeModernBERT-Crow-v1-Pre
## Model Description
**CodeModernBERT-Crow-v1-Pre** is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language.
It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.
* **License**: Apache-2.0
* **Supported Languages**: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
* **Datasets**:
* [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)
* [bigcode/starcoder2data-extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras)
* **Pipeline tag**: `fill-mask`
This model is a **pretraining checkpoint**, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.
---
## Training Objective
The model was pretrained on large-scale multilingual code corpora with the following goals:
* Learn robust code representations across multiple programming languages.
* Capture semantic relations between code tokens and natural language descriptions.
* Provide a strong initialization point for fine-tuning on code-related downstream tasks.
---
## Tokenizer
A custom **BPE tokenizer** was trained for code and docstrings.
* **Vocabulary size**: 50,368
* **Special tokens**: Standard Hugging Face special tokens + custom tokens for code/document structure.
* **Training process**:
* Up to 1M examples per dataset.
* Each example truncated to 10,000 characters.
* Trained with files from multiple datasets (see above).
---
## Architecture
* **Base**: ModernBERT
* **Hidden size**: 768
* **Number of layers**: 12
* **Attention heads**: 12
* **Intermediate size**: 3072
* **Max sequence length: 8192** (during training, inputs were limited to 1024 tokens)
* **RoPE positional encoding**: supported
---
## Usage
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)
```
The model can be fine-tuned for:
* Code search (query ↔ code retrieval)
* Code clone detection
* Code summarization (docstring prediction)
* Bug detection and repair (masked language modeling or cloze-style)
---
## Limitations
* The model is not optimized for direct code generation.
* Pretraining does not guarantee correctness of code execution.
* Fine-tuning is recommended for specific downstream applications.
---
## Intended Use
* Research in software engineering and natural language processing for code.
* Educational exploration of pretrained models for code tasks.
* Baseline for continued pretraining or fine-tuning.
---