CodeModernBERT-Crow-v1.1🐦‍⬛

Model Details

Model type: Bi-encoder architecture based on ModernBERT
Architecture:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Intermediate size: 3,072
- Max position embeddings: 8,192
- Local attention window size: 128
- RoPE positional encoding: θ = 160,000
- Local RoPE positional encoding: θ = 10,000
Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining

Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.
Data: Functions and natural language descriptions extracted from GitHub repositories.
Masking strategy: Two-phase pretraining.
- Phase 1: Random Masked Language Modeling (MLM)
  30% of tokens in code functions are randomly masked and predicted using standard MLM.
- Phase 2: Line-level Span Masking
  Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
  1. Convert input tokens back to strings.
  2. Detect newline tokens with regex and segment inputs by line.
  3. Exclude whitespace-only tokens from masking.
  4. Apply padding to align sequence lengths.
  5. Randomly mask 30% of tokens in each line segment and predict them.
Pretraining hyperparameters:
- Batch size: 16
- Gradient accumulation steps: 16
- Effective batch size: 256
- Optimizer: AdamW
- Learning rate: 5e-5
- Scheduler: Cosine
- Epochs: 3
- Precision: Mixed precision (fp16) using transformers