library_name: transformers
tags:
- code
- cybersecurity
- vulnerability
- cpp
license: apache-2.0
datasets:
- lemon42-ai/minified-diverseful-multilabels
metrics:
- accuracy
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
Model Card for ThreatDetect-C-Cpp

This is a derivative version of answerdotai/ModernBERT-base.
We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code.
The actual version has an accuracy of 86%
Model Details
Model Description
ThreatDetect-C-Cpp can be used as a code classifier.
Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses:
Label | Description |
---|---|
CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer |
CWE-125 | Out-of-bounds Read |
CWE-20 | Improper Input Validation |
CWE-416 | Use After Free |
CWE-703 | Improper Check or Handling of Exceptional Conditions |
CWE-787 | Out-of-bounds Write |
safe | Safe code |
- Developed by: lemon42-ai
- Contributers Abdellah Oumida & Mohammed Sbaihi
- Model type: ModernBERT, Encoder-only Transformer
- Supported Programming Languages: C/C++
- License: Apache 2.0 (see original License of ModernBERT-Base)
- Finetuned from model: answerdotai/ModernBERT-base.
Model Sources [optional]
- Repository: The official lemon42-ai Github repository
- Technical Blog Post: Coming soon.
Uses
ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code.
Bias, Risks, and Limitations
ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.
The model can only detect the six CWEs in the table above.
Training Details
Training Data
The model was fine-tuned on a minified, clean and deduplicated version of DiverseVul dataset.
This new version can be explored on HF datasets HERE
Training Procedure
The model was trained using LoRA applied to Q and V matrices.
Training Hyperparameters
Hyperparameter | Value |
---|---|
Max Sequence Length | 600 |
Batch Size | 32 |
Number of Epochs | 9 |
Learning Rate | 5e-4 |
Weight Decay | 0.01 |
Logging Steps | 100 |
LoRA Rank (r) | 8 |
LoRA Alpha | 32 |
LoRA Dropout | 0.1 |
LoRA Target Modules | attn.Wqkv |
Optimizer | AdamW |
LR Scheduler | CosineAnnealingWarmRestarts |
Scheduler T_0 | 10 |
Scheduler T_mult | 2 |
Scheduler eta_min | 1e-6 |
Training Split Ratio | 90% Train / 10% Validation |
Seed for Splitting | 42 |
Evaluation
ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set.
Technical Specifications
Hardware
The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.