CodeModernBERT-Crow-v1.1 / README.md

Shuu12121

Create README.md

d7baa19 verified 7 days ago

preview code

raw

history blame contribute delete

2.19 kB

metadata

license: apache-2.0
datasets:
  - Shuu12121/python-treesitter-filtered-datasetsV2
  - Shuu12121/javascript-treesitter-filtered-datasetsV2
  - Shuu12121/ruby-treesitter-filtered-datasetsV2
  - Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
  - Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
  - Shuu12121/rust-treesitter-filtered-datasetsV2
  - Shuu12121/php-treesitter-filtered-datasetsV2
  - Shuu12121/typescript-treesitter-filtered-datasetsV2
pipeline_tag: fill-mask
tags:
  - code
  - python
  - java
  - javascript
  - typescript
  - go
  - ruby
  - rust
  - php
language:
  - en
base_model:
  - Shuu12121/CodeModernBERT-Crow-v1-Pre

CodeModernBERT-Crow-v1.1🐦‍⬛

Model Details

Model type: Bi-encoder architecture based on ModernBERT
Architecture:
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Intermediate size: 3,072
- Max position embeddings: 8,192
- Local attention window size: 128
- RoPE positional encoding: θ = 160,000
- Local RoPE positional encoding: θ = 10,000
Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining

Pretraining

Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.
Data: Functions and natural language descriptions extracted from GitHub repositories.
Masking strategy: Two-phase pretraining.
- Phase 1: Random Masked Language Modeling (MLM)
  30% of tokens in code functions are randomly masked and predicted using standard MLM.
- Phase 2: Line-level Span Masking
  Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
  1. Convert input tokens back to strings.
  2. Detect newline tokens with regex and segment inputs by line.
  3. Exclude whitespace-only tokens from masking.
  4. Apply padding to align sequence lengths.
  5. Randomly mask 30% of tokens in each line segment and predict them.
Pretraining hyperparameters:
- Batch size: 16
- Gradient accumulation steps: 16
- Effective batch size: 256
- Optimizer: AdamW
- Learning rate: 5e-5
- Scheduler: Cosine
- Epochs: 3
- Precision: Mixed precision (fp16) using transformers