metadata
license: apache-2.0
datasets:
- Shuu12121/python-treesitter-filtered-datasetsV2
- Shuu12121/javascript-treesitter-filtered-datasetsV2
- Shuu12121/ruby-treesitter-filtered-datasetsV2
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/rust-treesitter-filtered-datasetsV2
- Shuu12121/php-treesitter-filtered-datasetsV2
- Shuu12121/typescript-treesitter-filtered-datasetsV2
pipeline_tag: fill-mask
tags:
- code
- python
- java
- javascript
- typescript
- go
- ruby
- rust
- php
language:
- en
base_model:
- Shuu12121/CodeModernBERT-Crow-v1-Pre
CodeModernBERT-Crow-v1.1🐦⬛
Model Details
- Model type: Bi-encoder architecture based on ModernBERT
- Architecture:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Intermediate size: 3,072
- Max position embeddings: 8,192
- Local attention window size: 128
- RoPE positional encoding: θ = 160,000
- Local RoPE positional encoding: θ = 10,000
- Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining
Pretraining
Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.
Data: Functions and natural language descriptions extracted from GitHub repositories.
Masking strategy: Two-phase pretraining.
- Phase 1: Random Masked Language Modeling (MLM)
30% of tokens in code functions are randomly masked and predicted using standard MLM. - Phase 2: Line-level Span Masking
Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:- Convert input tokens back to strings.
- Detect newline tokens with regex and segment inputs by line.
- Exclude whitespace-only tokens from masking.
- Apply padding to align sequence lengths.
- Randomly mask 30% of tokens in each line segment and predict them.
- Phase 1: Random Masked Language Modeling (MLM)
Pretraining hyperparameters:
- Batch size: 16
- Gradient accumulation steps: 16
- Effective batch size: 256
- Optimizer: AdamW
- Learning rate: 5e-5
- Scheduler: Cosine
- Epochs: 3
- Precision: Mixed precision (fp16) using
transformers