Model Details
The development of the GenTel-Shield detection model follows a five-step process. First, a training dataset is constructed by gathering data from online sources and expert contributions. This data then undergoes binary labeling and cleaning to ensure quality. Next, data augmentation techniques are applied to expand the dataset. Following this, a pre-trained model is employed for the training phase. Finally, the trained model can distinguish between malicious and benign samples.
Below is a workflow of GenTel-Shield.
Training Data Preparation
Data Collection
Our training data is drawn from two primary sources. The first source encompasses risk data from public platforms, including websites such as jailbreakchat.com and reddit.com, in addition to established datasets from LLM applications, such as the VMware Open-Instruct dataset and the Chatbot Instruction Prompts dataset. And domain experts have annotated these examples, categorizing the prompts into two distinct groups: harmful injection attack samples and benign samples.
Data Augmentation
In real-world scenarios, we have encountered adversarial samples, such as those with added meaningless characters or deleted words, that can bypass detection by defense models, potentially leading to dangerous behaviors. To enhance the robustness of our detection model, we implemented data augmentation focusing on both semantic alterations and character-level perturbations of the samples. We employed four simple yet effective operations for character perturbation: synonym replacement, random insertion, random swap, and random deletion. We used LLMs to rewrite our data for semantic augmentation, thereby generating a more diverse set of training samples.
Model Training Details
We finetune the GenTel-Shield model on our proposed training text-pair dataset, initializing it from the multilingual E5 text embedding model. Training is conducted on a single machine equipped with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate 2e-5, employing a cosine learning rate scheduler and a weight decay of 0.01 to mitigate overfitting. To optimize memory usage, we utilize mixed precision (fp16) training. Additionally, the training process includes a 500-step warmup phase, and we apply gradient clipping with a maximum norm of 1.0.
Evaluation
Dataset
Gentel-Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel-Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.
Gentel-Bench
We evaluate the modelβs effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel-Bench. The results demonstrate that our approach outperforms existing methods in most scenarios, particularly in terms of accuracy and F1 score.
Classification performance on Jailbreak Attack Scenarios
Method | Accuracy β | Precision β | F1 β | Recall β |
---|---|---|---|---|
ProtectAI | 89.46 | 99.59 | 88.62 | 79.83 |
Hyperion | 94.70 | 94.21 | 94.88 | 95.57 |
Prompt Guard | 50.58 | 51.03 | 66.85 | 96.88 |
Lakera AI | 87.20 | 92.12 | 86.84 | 82.14 |
Deepset | 65.69 | 60.63 | 75.49 | 100 |
Fmops | 63.35 | 59.04 | 74.25 | 100 |
WhyLabs LangKit | 78.86 | 98.48 | 75.28 | 60.92 |
GenTel-Shield(Ours) | 97.63 | 98.04 | 97.69 | 97.34 |
Classification performance on Goal Hijacking Attack Scenarios.
Method | Accuracy β | Precision β | F1 β | Recall β |
---|---|---|---|---|
ProtectAI | 94.25 | 99.79 | 93.95 | 88.76 |
Hyperion | 90.68 | 94.53 | 90.33 | 86.48 |
Prompt Guard | 50.90 | 50.61 | 67.21 | 100 |
Lakera AI | 74.63 | 88.59 | 69.33 | 56.95 |
Deepset | 63.40 | 57.90 | 73.34 | 100 |
Fmops | 61.03 | 56.36 | 72.09 | 100 |
WhyLabs LangKit | 68.14 | 97.53 | 54.35 | 37.67 |
GenTel-Shield(Ours) | 96.81 | 99.44 | 96.74 | 94.19 |
Classification Performance on Prompt Leaking Attack Scenarios.
Method | Accuracy β | Precision β | F1 β | Recall β |
---|---|---|---|---|
ProtectAI | 90.94 | 99.77 | 90.06 | 82.08 |
Hyperion | 90.85 | 95.01 | 90.41 | 86.23 |
Prompt Guard | 50.28 | 50.14 | 66.79 | 100 |
Lakera AI | 96.04 | 93.11 | 96.17 | 99.43 |
Deepset | 61.79 | 57.08 | 71.34 | 95.09 |
Fmops | 58.77 | 55.07 | 69.80 | 95.28 |
WhyLabs LangKit | 99.34 | 99.62 | 99.34 | 99.06 |
GenTel-Shield(Ours) | 97.92 | 99.42 | 97.89 | 96.42 |
Subdivision Scenarios
Citation
Li, Rongchang, et al. "GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks" arXiv preprint arXiv:2409.19521 (2024).
- Downloads last month
- 87