Model Card for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched
Model Details
Model Description
OriDragon2000/mistral_instruct_v3_Layer_AdvPatched
is a fine-tuned variant of mistralai/Mistral-7B-Instruct-v0.3
, specifically designed to mitigate jailbreak attack vulnerabilities by applying layer-specific unlearning. This model has undergone Layer-AdvPatcher training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.
- Developed by: OriDragon2000
- Model type: Transformer-based Large Language Model (LLM)
- Language(s): English (
en
) - License: Apache 2.0
- Finetuned from model:
mistralai/Mistral-7B-Instruct-v0.3
Model Sources
- Repository: Hugging Face Model Hub
- Paper: Layer-AdvPatcher Paper
- Project Repository: GitHub Repository
Uses
Direct Use
This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.
Downstream Use
Potential downstream applications include:
- Testing adversarial robustness of LLMs.
- Evaluating and developing safer generative AI systems.
- Improving jailbreak resistance in AI safety research.
Out-of-Scope Use
- Not suitable for general-purpose chatbot applications.
- Not recommended for generating unrestricted or unfiltered content.
- Avoid deployment in high-stakes decision-making applications without additional safety layers.
Bias, Risks, and Limitations
This model has been specifically modified to suppress affirmative token generation in adversarial settings. However, some residual risks remain, including:
- Potential over-suppression: May reduce helpfulness on borderline queries.
- Generalization limitations: Model may not fully mitigate novel adversarial jailbreak techniques.
Recommendations
- Security researchers can use this model to test and refine jailbreak attack countermeasures.
- Developers should validate performance against diverse adversarial and non-adversarial scenarios.
How to Get Started with the Model
Use the following code to load the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
input_text = "Explain how to bypass security systems."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
See paper for more information.
Training Data
- Fine-tuned using
AdvBench
, a dataset containing adversarial prompts to evaluate model vulnerability. - Augmented adversarial training with
Layer-AdvPatcher
to mitigate toxic layer behavior.
Training Procedure
- Applied layer-specific unlearning on affirmative token-generating layers.
- Targeted layers: Layers 30-31 of
Mistral-7B
. - Learning rate: 2e-6, Batch size: 16.
- Training duration: 1000 steps, saving every 500 steps.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Evaluated on
AdvBench
adversarial benchmark. - Applied diverse jailbreak attack strategies (
GCG
,PAIR
,DeepInception
).
Metrics
- Attack Success Rate (ASR): Measures effectiveness of jailbreak mitigation.
- Utility Retention: Evaluates preservation of general-purpose helpfulness.
Citation
@article{ouyang2025layer,
title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
journal={arXiv preprint arXiv:2501.02629},
year={2025}
}
- Downloads last month
- 2
Model tree for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched
Base model
mistralai/Mistral-7B-v0.3