Vulnerability Detector for C Code (SARD)

This model is a fine-tuned version of microsoft/codebert-base designed to detect vulnerabilities in C source code functions.

Model Description

This is a binary text-classification model that takes a C function as input and classifies it as either Vulnerable (LABEL_1) or Safe (LABEL_0).

The model was specifically fine-tuned on the NIST SARD (Software Assurance Reference Dataset), focusing on common C vulnerabilities like Memory Leaks, Buffer Overflows, and other CWEs present in the Juliet Test Suite. Due to the clean and structured nature of the SARD dataset, the model achieved a very high accuracy on the validation set.

Intended Uses & Limitations

This model is intended as a proof-of-concept tool to assist developers in identifying potentially vulnerable code patterns during the development lifecycle.

Limitations:

The model is highly specialized for the types of vulnerabilities found in the SARD dataset. Its performance on real-world, messy, or obfuscated code may be lower.
It should be used as an assistive tool, not as a replacement for comprehensive security audits or other static analysis tools.
The model classifies entire functions and may not pinpoint the exact line of code responsible for the vulnerability.

How to Use

The model can be easily used with the transformers library pipeline.

from transformers import pipeline

# Load the classifier pipeline
classifier = pipeline("text-classification", model="jacpacd/vuln-detector-codebert-c-sard")

# Example of a vulnerable C function (Memory Leak)
vulnerable_code = """
void CWE401_Memory_Leak__strdup_char_01_bad()
{
    char * data;
    data = NULL;
    {
        char myString[] = "myString";
        /* POTENTIAL FLAW: Allocate memory from the heap */
        data = strdup(myString);
        printLine(data);
    }
    /* POTENTIAL FLAW: No deallocation of memory */
    ;
}
"""

# Example of a safe C function
safe_code = """
void CWE401_Memory_Leak__strdup_char_01_goodB2G()
{
    char * data;
    data = NULL;
    {
        char myString[] = "myString";
        data = strdup(myString);
        printLine(data);
    }
    /* FIX: Deallocate memory */
    free(data);
}
"""

results_vuln = classifier(vulnerable_code)
results_safe = classifier(safe_code)

print(f"Vulnerable Code Prediction: {results_vuln[0]}")
# Expected output: {'label': 'LABEL_1', 'score': 0.99...}

print(f"Safe Code Prediction: {results_safe[0]}")
# Expected output: {'label': 'LABEL_0', 'score': 0.99...}