Model Card for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Model Details

Model Description

OriDragon2000/mistral_instruct_v3_Layer_AdvPatched is a fine-tuned variant of mistralai/Mistral-7B-Instruct-v0.3, specifically designed to mitigate jailbreak attack vulnerabilities by applying layer-specific unlearning. This model has undergone Layer-AdvPatcher training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.

  • Developed by: OriDragon2000
  • Model type: Transformer-based Large Language Model (LLM)
  • Language(s): English (en)
  • License: Apache 2.0
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.3

Model Sources

Uses

Direct Use

This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.

Downstream Use

Potential downstream applications include:

  • Testing adversarial robustness of LLMs.
  • Evaluating and developing safer generative AI systems.
  • Improving jailbreak resistance in AI safety research.

Out-of-Scope Use

  • Not suitable for general-purpose chatbot applications.
  • Not recommended for generating unrestricted or unfiltered content.
  • Avoid deployment in high-stakes decision-making applications without additional safety layers.

Bias, Risks, and Limitations

This model has been specifically modified to suppress affirmative token generation in adversarial settings. However, some residual risks remain, including:

  • Potential over-suppression: May reduce helpfulness on borderline queries.
  • Generalization limitations: Model may not fully mitigate novel adversarial jailbreak techniques.

Recommendations

  • Security researchers can use this model to test and refine jailbreak attack countermeasures.
  • Developers should validate performance against diverse adversarial and non-adversarial scenarios.

How to Get Started with the Model

Use the following code to load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")

input_text = "Explain how to bypass security systems."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

See paper for more information.

Training Data

  • Fine-tuned using AdvBench, a dataset containing adversarial prompts to evaluate model vulnerability.
  • Augmented adversarial training with Layer-AdvPatcher to mitigate toxic layer behavior.

Training Procedure

  • Applied layer-specific unlearning on affirmative token-generating layers.
  • Targeted layers: Layers 30-31 of Mistral-7B.
  • Learning rate: 2e-6, Batch size: 16.
  • Training duration: 1000 steps, saving every 500 steps.

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Evaluated on AdvBench adversarial benchmark.
  • Applied diverse jailbreak attack strategies (GCG, PAIR, DeepInception).

Metrics

  • Attack Success Rate (ASR): Measures effectiveness of jailbreak mitigation.
  • Utility Retention: Evaluates preservation of general-purpose helpfulness.

Citation

@article{ouyang2025layer,
  title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
  author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
  journal={arXiv preprint arXiv:2501.02629},
  year={2025}
}
Downloads last month
2
Safetensors
Model size
7.25B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Finetuned
(109)
this model

Dataset used to train OriDragon2000/mistral_instruct_v3_Layer_AdvPatched