Model Details

The development of the GenTel-Shield detection model follows a five-step process. First, a training dataset is constructed by gathering data from online sources and expert contributions. This data then undergoes binary labeling and cleaning to ensure quality. Next, data augmentation techniques are applied to expand the dataset. Following this, a pre-trained model is employed for the training phase. Finally, the trained model can distinguish between malicious and benign samples.

Below is a workflow of GenTel-Shield.

Training Data Preparation

Data Collection

Our training data is drawn from two primary sources. The first source encompasses risk data from public platforms, including websites such as jailbreakchat.com and reddit.com, in addition to established datasets from LLM applications, such as the VMware Open-Instruct dataset and the Chatbot Instruction Prompts dataset. And domain experts have annotated these examples, categorizing the prompts into two distinct groups: harmful injection attack samples and benign samples.

Data Augmentation

In real-world scenarios, we have encountered adversarial samples, such as those with added meaningless characters or deleted words, that can bypass detection by defense models, potentially leading to dangerous behaviors. To enhance the robustness of our detection model, we implemented data augmentation focusing on both semantic alterations and character-level perturbations of the samples. We employed four simple yet effective operations for character perturbation: synonym replacement, random insertion, random swap, and random deletion. We used LLMs to rewrite our data for semantic augmentation, thereby generating a more diverse set of training samples.

Model Training Details

We finetune the GenTel-Shield model on our proposed training text-pair dataset, initializing it from the multilingual E5 text embedding model. Training is conducted on a single machine equipped with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate 2e-5, employing a cosine learning rate scheduler and a weight decay of 0.01 to mitigate overfitting. To optimize memory usage, we utilize mixed precision (fp16) training. Additionally, the training process includes a 500-step warmup phase, and we apply gradient clipping with a maximum norm of 1.0.

Evaluation

Dataset

Gentel-Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel-Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.

Gentel-Bench

We evaluate the model’s effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel-Bench. The results demonstrate that our approach outperforms existing methods in most scenarios, particularly in terms of accuracy and F1 score.

Classification performance on Jailbreak Attack Scenarios

Method	Accuracy ↑	Precision ↑	F1 ↑	Recall ↑
ProtectAI	89.46	99.59	88.62	79.83
Hyperion	94.70	94.21	94.88	95.57
Prompt Guard	50.58	51.03	66.85	96.88
Lakera AI	87.20	92.12	86.84	82.14
Deepset	65.69	60.63	75.49	100
Fmops	63.35	59.04	74.25	100
WhyLabs LangKit	78.86	98.48	75.28	60.92
GenTel-Shield(Ours)	97.63	98.04	97.69	97.34

Classification performance on Goal Hijacking Attack Scenarios.

Method	Accuracy ↑	Precision ↑	F1 ↑	Recall ↑
ProtectAI	94.25	99.79	93.95	88.76
Hyperion	90.68	94.53	90.33	86.48
Prompt Guard	50.90	50.61	67.21	100
Lakera AI	74.63	88.59	69.33	56.95
Deepset	63.40	57.90	73.34	100
Fmops	61.03	56.36	72.09	100
WhyLabs LangKit	68.14	97.53	54.35	37.67
GenTel-Shield(Ours)	96.81	99.44	96.74	94.19

Classification Performance on Prompt Leaking Attack Scenarios.

Method	Accuracy ↑	Precision ↑	F1 ↑	Recall ↑
ProtectAI	90.94	99.77	90.06	82.08
Hyperion	90.85	95.01	90.41	86.23
Prompt Guard	50.28	50.14	66.79	100
Lakera AI	96.04	93.11	96.17	99.43
Deepset	61.79	57.08	71.34	95.09
Fmops	58.77	55.07	69.80	95.28
WhyLabs LangKit	99.34	99.62	99.34	99.06
GenTel-Shield(Ours)	97.92	99.42	97.89	96.42

Subdivision Scenarios

Citation

Li, Rongchang, et al. "GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks" arXiv preprint arXiv:2409.19521 (2024).