CodeModernBERT-Crow-v1-Pre

Model Description

CodeModernBERT-Crow-v1-Pre is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language. It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.

License: Apache-2.0
Supported Languages: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
Datasets:
- bigcode/starcoderdata
- bigcode/starcoder2data-extras
Pipeline tag: fill-mask

This model is a pretraining checkpoint, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.

Training Objective

The model was pretrained on large-scale multilingual code corpora with the following goals:

Learn robust code representations across multiple programming languages.
Capture semantic relations between code tokens and natural language descriptions.
Provide a strong initialization point for fine-tuning on code-related downstream tasks.

Tokenizer

A custom BPE tokenizer was trained for code and docstrings.

Vocabulary size: 50,368
Special tokens: Standard Hugging Face special tokens + custom tokens for code/document structure.
Training process:
- Up to 1M examples per dataset.
- Each example truncated to 10,000 characters.
- Trained with files from multiple datasets (see above).

Architecture

Base: ModernBERT
Hidden size: 768
Number of layers: 12
Attention heads: 12
Intermediate size: 3072
Max sequence length: 8192 (during training, inputs were limited to 1024 tokens)
RoPE positional encoding: supported

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")

inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)

The model can be fine-tuned for:

Code search (query ↔ code retrieval)
Code clone detection
Code summarization (docstring prediction)
Bug detection and repair (masked language modeling or cloze-style)

Limitations

The model is not optimized for direct code generation.
Pretraining does not guarantee correctness of code execution.
Fine-tuning is recommended for specific downstream applications.

Intended Use

Research in software engineering and natural language processing for code.
Educational exploration of pretrained models for code tasks.
Baseline for continued pretraining or fine-tuning.

Downloads last month: 8

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shuu12121/CodeModernBERT-Crow-v1-Pre

Finetunes

1 model

Datasets used to train Shuu12121/CodeModernBERT-Crow-v1-Pre

Collection including Shuu12121/CodeModernBERT-Crow-v1-Pre

Crow-v1

Collection

3 items • Updated Sep 21