Shuu12121's picture
Create README.md
d7baa19 verified
metadata
license: apache-2.0
datasets:
  - Shuu12121/python-treesitter-filtered-datasetsV2
  - Shuu12121/javascript-treesitter-filtered-datasetsV2
  - Shuu12121/ruby-treesitter-filtered-datasetsV2
  - Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
  - Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
  - Shuu12121/rust-treesitter-filtered-datasetsV2
  - Shuu12121/php-treesitter-filtered-datasetsV2
  - Shuu12121/typescript-treesitter-filtered-datasetsV2
pipeline_tag: fill-mask
tags:
  - code
  - python
  - java
  - javascript
  - typescript
  - go
  - ruby
  - rust
  - php
language:
  - en
base_model:
  - Shuu12121/CodeModernBERT-Crow-v1-Pre

CodeModernBERT-Crow-v1.1🐦‍⬛

Model Details

  • Model type: Bi-encoder architecture based on ModernBERT
  • Architecture:
    • Hidden size: 768
    • Layers: 12
    • Attention heads: 12
    • Intermediate size: 3,072
    • Max position embeddings: 8,192
    • Local attention window size: 128
    • RoPE positional encoding: θ = 160,000
    • Local RoPE positional encoding: θ = 10,000
  • Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining

Pretraining

  • Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.

  • Data: Functions and natural language descriptions extracted from GitHub repositories.

  • Masking strategy: Two-phase pretraining.

    • Phase 1: Random Masked Language Modeling (MLM)
      30% of tokens in code functions are randomly masked and predicted using standard MLM.
    • Phase 2: Line-level Span Masking
      Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
      1. Convert input tokens back to strings.
      2. Detect newline tokens with regex and segment inputs by line.
      3. Exclude whitespace-only tokens from masking.
      4. Apply padding to align sequence lengths.
      5. Randomly mask 30% of tokens in each line segment and predict them.
  • Pretraining hyperparameters:

    • Batch size: 16
    • Gradient accumulation steps: 16
    • Effective batch size: 256
    • Optimizer: AdamW
    • Learning rate: 5e-5
    • Scheduler: Cosine
    • Epochs: 3
    • Precision: Mixed precision (fp16) using transformers