|
--- |
|
base_model: |
|
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
|
--- |
|
# DeepSeek-R1-Distill-Llama-8B-ENK-Aligned |
|
|
|
## Overview |
|
|
|
**DeepSeek-R1-Distill-Llama-8B-ENK-Aligned** is a safety-aligned version of [`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B). It has been aligned using the **Enkrypt AI Safety Alignment dataset**, which was generated with the **SAGE** process: |
|
|
|
> **SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming** |
|
> Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi (2024) |
|
> [[arXiv:2408.11851]](https://arxiv.org/abs/2408.11851) |
|
|
|
This alignment significantly **reduces toxicity, harmfulness, and jailbreak vulnerabilities** across various safety topics while **maintaining model performance**. |
|
|
|
## Red Team Results |
|
|
|
 |
|
|
|
## Performance Results |
|
| Model | MMLU-Pro Score | |
|
|--------|----------------| |
|
| DeepSeek-R1-Distill-Llama-8B (Base) | **44.71** | |
|
| DeepSeek-R1-Distill-Llama-8B-ENK-Aligned | **46.43** | |
|
|
|
## Training Configuration |
|
|
|
The model was trained using the **SimPO (Simple Preference Optimization)** approach with the following hyperparameters: |
|
|
|
```yaml |
|
cpo_config: |
|
loss_type: 'simpo' |
|
max_prompt_length: 1800 |
|
max_length: 3600 |
|
per_device_train_batch_size: 8 |
|
gradient_accumulation_steps: 1 |
|
learning_rate: 1.8e-6 |
|
optim: 'adamw_torch' |
|
lr_scheduler_type: 'cosine' |
|
gradient_checkpointing: True |
|
beta: 5 |
|
num_train_epochs: 1 |
|
bf16: False |
|
simpo_gamma: 0.8 |
|
warmup_ratio: 0.1 |
|
cpo_alpha: 0.0 |
|
``` |
|
|
|
## Key Improvements |
|
|
|
- **Enhanced Safety**: Significant reduction in harmful or toxic outputs. |
|
- **Improved Robustness**: Stronger resistance to adversarial jailbreak prompts. |
|
- **Minimal Performance Tradeoff**: Slight improvement in MMLU-Pro despite additional alignment constraints. |
|
|
|
## Use Cases |
|
|
|
This model is ideal for applications requiring **safe, aligned, and high-performance language generation**, including: |
|
- **Conversational AI**: Ensuring responsible and aligned assistant behavior. |
|
- **Content Moderation**: Filtering harmful content while maintaining contextual understanding. |
|
- **Education & Research**: Deploying AI in sensitive environments with reduced risks. |
|
|
|
<!-- ## Citation |
|
|
|
If you use this model, please cite the SAGE-RT paper: |
|
|
|
```bibtex |
|
@misc{kumar2024sagertsyntheticalignmentdata, |
|
title={SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming}, |
|
author={Anurakt Kumar and Divyanshu Kumar and Jatan Loya and Nitin Aravind Birur and Tanay Baswa and Sahil Agarwal and Prashanth Harshangi}, |
|
year={2024}, |
|
eprint={2408.11851}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.AI}, |
|
url={https://arxiv.org/abs/2408.11851} |
|
} |
|
``` --> |
|
|
|
--- |
|
For questions or contributions, reach out to the **Enkrypt AI** team! |
|
|
|
|
|
|
|
|