CodeBERT-VulnCWE / README.md

Update README.md

2809053 verified 5 months ago

4.28 kB

	---
	license: mit
	datasets:
	- mahdin70/cwe_enriched_balanced_bigvul_primevul
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	base_model:
	- microsoft/codebert-base
	library_name: transformers
	---


	# CodeBERT-VulnCWE - Fine-Tuned CodeBERT for Vulnerability and CWE Classification

	## Model Overview
	This model is a fine-tuned version of microsoft/codebert-base on a curated and enriched dataset for vulnerability detection and CWE classification. It is capable of predicting whether a given code snippet is vulnerable and, if vulnerable, identifying the specific CWE ID associated with it.

	## Dataset
	The model was fine-tuned using the dataset [mahdin70/cwe_enriched_balanced_bigvul_primevul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul). The dataset contains both vulnerable and non-vulnerable code samples and is enriched with CWE metadata.

	### CWE IDs Covered:
	1. CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
	2. CWE-20: Improper Input Validation
	3. CWE-125: Out-of-bounds Read
	4. CWE-399: Resource Management Errors
	5. CWE-200: Information Exposure
	6. CWE-787: Out-of-bounds Write
	7. CWE-264: Permissions, Privileges, and Access Controls
	8. CWE-416: Use After Free
	9. CWE-476: NULL Pointer Dereference
	10. CWE-190: Integer Overflow or Wraparound
	11. CWE-189: Numeric Errors
	12. CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization

	---

	## Model Training
	The model was trained for 3 epochs with the following configuration:
	- Learning Rate: 2e-5
	- Weight Decay: 0.01
	- Batch Size: 8
	- Optimizer: AdamW
	- Scheduler: Linear

	### Training Loss and Validation Metrics Per Epoch:
	\| Epoch \| Training Loss \| Validation Loss \| Vul Accuracy \| Vul Precision \| Vul Recall \| Vul F1 \| CWE Accuracy \|
	\|-------\|---------------\|-----------------\|--------------\|---------------\|------------\|--------\|--------------\|
	\| 1 \| 1.4663 \| 1.4988 \| 0.7887 \| 0.8526 \| 0.5498 \| 0.6685 \| 0.2932 \|
	\| 2 \| 1.2107 \| 1.3474 \| 0.8038 \| 0.8493 \| 0.6002 \| 0.7034 \| 0.3688 \|
	\| 3 \| 1.1885 \| 1.3096 \| 0.8034 \| 0.8020 \| 0.6541 \| 0.7205 \| 0.3963 \|

	#### Training Summary:
	- Total Training Steps: 2958
	- Training Loss: 1.3862
	- Training Time: 3058.7 seconds (~51 minutes)
	- Training Speed: 15.47 samples per second
	- Steps Per Second: 0.967


	## How to Use the Model
	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("mahdin70/CodeBERT-VulnCWE", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

	code_snippet = "int main() { int arr[10]; arr[11] = 5; return 0; }"
	inputs = tokenizer(code_snippet, return_tensors="pt")
	outputs = model(**inputs)

	vul_logits = outputs["vul_logits"]
	cwe_logits = outputs["cwe_logits"]

	vul_pred = vul_logits.argmax(dim=1).item()
	cwe_pred = kov_logits.argmax(dim=1).item()

	print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Non-vulnerable'}")
	print(f"CWE ID: {cwe_pred if vul_pred == 1 else 'N/A'}")
	```

	## Limitations and Future Improvements
	- The model achieves a CWE classification accuracy of 39.63% on the validation set, indicating significant room for improvement. Advanced architectures, better data balancing, or additional pretraining could enhance performance.
	- The model's vulnerability detection F1-score (72.05% on validation) is moderate but could be improved with further tuning or a larger dataset.
	- The model may struggle with edge cases or CWEs not well-represented in the training data.
	- Test set evaluation metrics are pending. Running the model on the test set will provide a clearer picture of its generalization.


	## Notes
	- Ensure the `trust_remote_code=True` flag is used when loading the model, as it relies on custom code for the `MultiTaskCodeBERT` architecture.
	- The model expects input code snippets tokenized using the CodeBERT tokenizer (`microsoft/codebert-base`).
	- For best results, preprocess code snippets consistently with the training dataset (e.g., max length of 512 tokens).