Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`

Model Details

Model Description

OriDragon2000/mistral_instruct_v3_Layer_AdvPatched is a fine-tuned variant of mistralai/Mistral-7B-Instruct-v0.3, specifically designed to mitigate jailbreak attack vulnerabilities by applying layer-specific unlearning. This model has undergone Layer-AdvPatcher training to suppress affirmative token generation in adversarial scenarios, reducing susceptibility to harmful prompts while maintaining general usability.

Developed by: OriDragon2000
Model type: Transformer-based Large Language Model (LLM)
Language(s): English (en)
License: Apache 2.0
Finetuned from model: mistralai/Mistral-7B-Instruct-v0.3

Model Sources

Repository: Hugging Face Model Hub
Paper: Layer-AdvPatcher Paper
Project Repository: GitHub Repository

Uses

Direct Use

This model is intended for research on adversarial robustness, jailbreak attack mitigation, and safety-aware LLM defenses.

Downstream Use

Potential downstream applications include:

Testing adversarial robustness of LLMs.
Evaluating and developing safer generative AI systems.
Improving jailbreak resistance in AI safety research.

Out-of-Scope Use

Not suitable for general-purpose chatbot applications.
Not recommended for generating unrestricted or unfiltered content.
Avoid deployment in high-stakes decision-making applications without additional safety layers.

Bias, Risks, and Limitations

This model has been specifically modified to suppress affirmative token generation in adversarial settings. However, some residual risks remain, including:

Potential over-suppression: May reduce helpfulness on borderline queries.
Generalization limitations: Model may not fully mitigate novel adversarial jailbreak techniques.

Recommendations

Security researchers can use this model to test and refine jailbreak attack countermeasures.
Developers should validate performance against diverse adversarial and non-adversarial scenarios.

How to Get Started with the Model

Use the following code to load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")
model = AutoModelForCausalLM.from_pretrained("OriDragon2000/mistral_instruct_v3_Layer_AdvPatched")

input_text = "Explain how to bypass security systems."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

See paper for more information.

Training Data

Fine-tuned using AdvBench, a dataset containing adversarial prompts to evaluate model vulnerability.
Augmented adversarial training with Layer-AdvPatcher to mitigate toxic layer behavior.

Training Procedure

Applied layer-specific unlearning on affirmative token-generating layers.
Targeted layers: Layers 30-31 of Mistral-7B.
Learning rate: 2e-6, Batch size: 16.
Training duration: 1000 steps, saving every 500 steps.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluated on AdvBench adversarial benchmark.
Applied diverse jailbreak attack strategies (GCG, PAIR, DeepInception).

Metrics

Attack Success Rate (ASR): Measures effectiveness of jailbreak mitigation.
Utility Retention: Evaluates preservation of general-purpose helpfulness.

Citation

@article{ouyang2025layer,
  title={Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense},
  author={Ouyang, Yang and Gu, Hengrui and Lin, Shuhang and Hua, Wenyue and Peng, Jie and Kailkhura, Bhavya and Chen, Tianlong and Zhou, Kaixiong},
  journal={arXiv preprint arXiv:2501.02629},
  year={2025}
}

OriDragon2000
/

mistral_instruct_v3_Layer_AdvPatched

Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Citation

Model tree for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Dataset used to train OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Model Card for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Citation

Model tree for OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Dataset used to train OriDragon2000/mistral_instruct_v3_Layer_AdvPatched

Model Card for `OriDragon2000/mistral_instruct_v3_Layer_AdvPatched`