README.md · Shuu12121/CodeModernBERT-Crow-v1-Pre at main

CodeModernBERT-Crow-v1-Pre / README.md

Shuu12121

Update README.md

0682419 verified 18 days ago

preview code

raw

history blame contribute delete

3.12 kB

	---
	license: apache-2.0
	datasets:
	- bigcode/starcoderdata
	- bigcode/starcoder2data-extras
	language:
	- en
	tags:
	- code
	- python
	- java
	- javascript
	- typescript
	- go
	- rust
	- php
	- ruby
	- cpp
	- c
	- sql
	---
	# CodeModernBERT-Crow-v1-Pre

	## Model Description

	CodeModernBERT-Crow-v1-Pre is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language.
	It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.

	* License: Apache-2.0
	* Supported Languages: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
	* Datasets:

	* [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)
	* [bigcode/starcoder2data-extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras)
	* Pipeline tag: `fill-mask`

	This model is a pretraining checkpoint, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.

	---

	## Training Objective

	The model was pretrained on large-scale multilingual code corpora with the following goals:

	* Learn robust code representations across multiple programming languages.
	* Capture semantic relations between code tokens and natural language descriptions.
	* Provide a strong initialization point for fine-tuning on code-related downstream tasks.

	---

	## Tokenizer

	A custom BPE tokenizer was trained for code and docstrings.

	* Vocabulary size: 50,368
	* Special tokens: Standard Hugging Face special tokens + custom tokens for code/document structure.
	* Training process:

	* Up to 1M examples per dataset.
	* Each example truncated to 10,000 characters.
	* Trained with files from multiple datasets (see above).

	---

	## Architecture

	* Base: ModernBERT
	* Hidden size: 768
	* Number of layers: 12
	* Attention heads: 12
	* Intermediate size: 3072
	* Max sequence length: 8192 (during training, inputs were limited to 1024 tokens)
	* RoPE positional encoding: supported

	---

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
	model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")

	inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
	outputs = model(**inputs)
	```

	The model can be fine-tuned for:

	* Code search (query ↔ code retrieval)
	* Code clone detection
	* Code summarization (docstring prediction)
	* Bug detection and repair (masked language modeling or cloze-style)

	---

	## Limitations

	* The model is not optimized for direct code generation.
	* Pretraining does not guarantee correctness of code execution.
	* Fine-tuning is recommended for specific downstream applications.

	---

	## Intended Use

	* Research in software engineering and natural language processing for code.
	* Educational exploration of pretrained models for code tasks.
	* Baseline for continued pretraining or fine-tuning.

	---