CodeModernBERT-Crow-v1-Pre

Model Description

CodeModernBERT-Crow-v1-Pre is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language. It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.

This model is a pretraining checkpoint, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.


Training Objective

The model was pretrained on large-scale multilingual code corpora with the following goals:

  • Learn robust code representations across multiple programming languages.
  • Capture semantic relations between code tokens and natural language descriptions.
  • Provide a strong initialization point for fine-tuning on code-related downstream tasks.

Tokenizer

A custom BPE tokenizer was trained for code and docstrings.

  • Vocabulary size: 50,368

  • Special tokens: Standard Hugging Face special tokens + custom tokens for code/document structure.

  • Training process:

    • Up to 1M examples per dataset.
    • Each example truncated to 10,000 characters.
    • Trained with files from multiple datasets (see above).

Architecture

  • Base: ModernBERT
  • Hidden size: 768
  • Number of layers: 12
  • Attention heads: 12
  • Intermediate size: 3072
  • Max sequence length: 8192 (during training, inputs were limited to 1024 tokens)
  • RoPE positional encoding: supported

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")

inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)

The model can be fine-tuned for:

  • Code search (query ↔ code retrieval)
  • Code clone detection
  • Code summarization (docstring prediction)
  • Bug detection and repair (masked language modeling or cloze-style)

Limitations

  • The model is not optimized for direct code generation.
  • Pretraining does not guarantee correctness of code execution.
  • Fine-tuning is recommended for specific downstream applications.

Intended Use

  • Research in software engineering and natural language processing for code.
  • Educational exploration of pretrained models for code tasks.
  • Baseline for continued pretraining or fine-tuning.

Downloads last month
33
Safetensors
Model size
153M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shuu12121/CodeModernBERT-Crow-v1-Pre

Finetunes
1 model

Datasets used to train Shuu12121/CodeModernBERT-Crow-v1-Pre

Collection including Shuu12121/CodeModernBERT-Crow-v1-Pre