metadata

library_name: transformers
tags:
  - code
  - cybersecurity
  - vulnerability
  - cpp
license: apache-2.0
datasets:
  - lemon42-ai/minified-diverseful-multilabels
metrics:
  - accuracy
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: text-classification

Model Card for ThreatDetect-C-Cpp

This is a derivative version of answerdotai/ModernBERT-base.
We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code.
The actual version has an accuracy of 86%

Model Details

Model Description

ThreatDetect-C-Cpp can be used as a code classifier.
Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses:

Label	Description
CWE-119	Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE-125	Out-of-bounds Read
CWE-20	Improper Input Validation
CWE-416	Use After Free
CWE-703	Improper Check or Handling of Exceptional Conditions
CWE-787	Out-of-bounds Write
safe	Safe code

Developed by: lemon42-ai
Contributers Abdellah Oumida & Mohammed Sbaihi
Model type: ModernBERT, Encoder-only Transformer
Supported Programming Languages: C/C++
License: Apache 2.0 (see original License of ModernBERT-Base)
Finetuned from model: answerdotai/ModernBERT-base.

Model Sources [optional]

Repository: The official lemon42-ai Github repository
Technical Blog Post: Coming soon.

Uses

ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code.

Bias, Risks, and Limitations

ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.
The model can only detect the six CWEs in the table above.

Training Details

Training Data

The model was fine-tuned on a minified, clean and deduplicated version of DiverseVul dataset.
This new version can be explored on HF datasets HERE

Training Procedure

The model was trained using LoRA applied to Q and V matrices.

Training Hyperparameters

Hyperparameter	Value
Max Sequence Length	600
Batch Size	32
Number of Epochs	9
Learning Rate	5e-4
Weight Decay	0.01
Logging Steps	100
LoRA Rank (r)	8
LoRA Alpha	32
LoRA Dropout	0.1
LoRA Target Modules	attn.Wqkv
Optimizer	AdamW
LR Scheduler	CosineAnnealingWarmRestarts
Scheduler T_0	10
Scheduler T_mult	2
Scheduler eta_min	1e-6
Training Split Ratio	90% Train / 10% Validation
Seed for Splitting	42

Evaluation

ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set.

Technical Specifications

Hardware

The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.