Shuu12121 commited on
Commit
d7baa19
·
verified ·
1 Parent(s): 7254cea

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Shuu12121/python-treesitter-filtered-datasetsV2
5
+ - Shuu12121/javascript-treesitter-filtered-datasetsV2
6
+ - Shuu12121/ruby-treesitter-filtered-datasetsV2
7
+ - Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
8
+ - Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
9
+ - Shuu12121/rust-treesitter-filtered-datasetsV2
10
+ - Shuu12121/php-treesitter-filtered-datasetsV2
11
+ - Shuu12121/typescript-treesitter-filtered-datasetsV2
12
+ pipeline_tag: fill-mask
13
+ tags:
14
+ - code
15
+ - python
16
+ - java
17
+ - javascript
18
+ - typescript
19
+ - go
20
+ - ruby
21
+ - rust
22
+ - php
23
+ language:
24
+ - en
25
+ base_model:
26
+ - Shuu12121/CodeModernBERT-Crow-v1-Pre
27
+ ---
28
+ # CodeModernBERT-Crow-v1.1🐦‍⬛
29
+
30
+ ## Model Details
31
+
32
+ * **Model type**: Bi-encoder architecture based on ModernBERT
33
+ * **Architecture**:
34
+ * Hidden size: 768
35
+ * Layers: 12
36
+ * Attention heads: 12
37
+ * Intermediate size: 3,072
38
+ * Max position embeddings: 8,192
39
+ * Local attention window size: 128
40
+ * RoPE positional encoding: θ = 160,000
41
+ * Local RoPE positional encoding: θ = 10,000
42
+ * **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining
43
+
44
+ ## Pretraining
45
+
46
+ * **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
47
+ * **Data**: Functions and natural language descriptions extracted from GitHub repositories.
48
+ * **Masking strategy**: Two-phase pretraining.
49
+ * **Phase 1: Random Masked Language Modeling (MLM)**
50
+ 30% of tokens in code functions are randomly masked and predicted using standard MLM.
51
+ * **Phase 2: Line-level Span Masking**
52
+ Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
53
+ 1. Convert input tokens back to strings.
54
+ 2. Detect newline tokens with regex and segment inputs by line.
55
+ 3. Exclude whitespace-only tokens from masking.
56
+ 4. Apply padding to align sequence lengths.
57
+ 5. Randomly mask 30% of tokens in each line segment and predict them.
58
+
59
+ * **Pretraining hyperparameters**:
60
+ * Batch size: 16
61
+ * Gradient accumulation steps: 16
62
+ * Effective batch size: 256
63
+ * Optimizer: AdamW
64
+ * Learning rate: 5e-5
65
+ * Scheduler: Cosine
66
+ * Epochs: 3
67
+ * Precision: Mixed precision (fp16) using `transformers`