--- library_name: transformers tags: - code - cybersecurity - vulnerability - cpp license: apache-2.0 datasets: - lemon42-ai/minified-diverseful-multilabels metrics: - accuracy base_model: - answerdotai/ModernBERT-base pipeline_tag: text-classification --- # Model Card for ThreatDetect-C-Cpp This is a derivative version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code.
The actual version has an accuracy of 86%
## Model Details ### Model Description ThreatDetect-C-Cpp can be used as a code classifier.
Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses: | Label | Description | |---------|-------------------------------------------------------| | CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer | | CWE-125 | Out-of-bounds Read | | CWE-20 | Improper Input Validation | | CWE-416 | Use After Free | | CWE-703 | Improper Check or Handling of Exceptional Conditions | | CWE-787 | Out-of-bounds Write | | safe | Safe code | - **Developed by:** [lemon42-ai](https://github.com/lemon42-ai) - **Contributers** [Abdellah Oumida](https://www.linkedin.com/in/abdellah-oumida-ab9082234/) & [Mohammed Sbaihi](https://www.linkedin.com/in/mohammed-sbaihi-aa6493254/) - **Model type:** [ModernBERT, Encoder-only Transformer](https://arxiv.org/abs/2412.13663) - **Supported Programming Languages:** C/C++ - **License:** Apache 2.0 (see original License of ModernBERT-Base) - **Finetuned from model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). ### Model Sources [optional] - **Repository:** [The official lemon42-ai Github repository](https://github.com/lemon42-ai/ThreatDetect-code-vulnerability-detection) - **Technical Blog Post:** Coming soon. ## Uses ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code. ## Bias, Risks, and Limitations ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.
The model can only detect the six CWEs in the table above. ## Training Details ### Training Data The model was fine-tuned on a minified, clean and deduplicated version of [DiverseVul](https://github.com/wagner-group/diversevul) dataset.
This new version can be explored on HF datasets [HERE](https://huggingface.co/datasets/lemon42-ai/minified-diverseful-multilabels) ### Training Procedure The model was trained using LoRA applied to Q and V matrices. #### Training Hyperparameters | Hyperparameter | Value | |-------------------------|---------------------------| | Max Sequence Length | 600 | | Batch Size | 32 | | Number of Epochs | 9 | | Learning Rate | 5e-4 | | Weight Decay | 0.01 | | Logging Steps | 100 | | LoRA Rank (r) | 8 | | LoRA Alpha | 32 | | LoRA Dropout | 0.1 | | LoRA Target Modules | attn.Wqkv | | Optimizer | AdamW | | LR Scheduler | CosineAnnealingWarmRestarts | | Scheduler T_0 | 10 | | Scheduler T_mult | 2 | | Scheduler eta_min | 1e-6 | | Training Split Ratio | 90% Train / 10% Validation | | Seed for Splitting | 42 | ## Evaluation ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set. ## Technical Specifications #### Hardware The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.