|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- allura-org/Koto-Small-7B-PT |
|
library_name: transformers |
|
tags: |
|
- writing |
|
- creative-writing |
|
- roleplay |
|
--- |
|
|
|
# Koto Small 7B (Instruct-Tuned) |
|
|
|
 |
|
|
|
Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases. |
|
|
|
|
|
## Usage |
|
|
|
### Chat template |
|
|
|
Trained with ChatML formatting, A typical input would look like this: |
|
|
|
``` |
|
<|im_start|>system |
|
system prompt<|im_end|> |
|
<|im_start|>user |
|
Hi there!<|im_end|> |
|
<|im_start|>assistant |
|
Nice to meet you!<|im_end|> |
|
<|im_start|>user |
|
Can I ask a question?<|im_end|> |
|
<|im_start|>assistant |
|
``` |
|
|
|
## Samplers |
|
|
|
We found that 1.25 temperature and 0.05 min_p worked best, but YMMV! |
|
|
|
## Datasets |
|
|
|
```yaml |
|
datasets: |
|
- path: Delta-Vector/Hydrus-General-Reasoning |
|
- path: Delta-Vector/Hydrus-IF-Mix-Ai2 |
|
- path: Delta-Vector/Hydrus-Army-Inst |
|
- path: Delta-Vector/Hydrus-AM-thinking-Science |
|
- path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered |
|
- path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think |
|
- path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2 |
|
- path: Delta-Vector/Hydrus-System-Chat-2.0 |
|
- path: Delta-Vector/Orion-Praxis-Co-Writer |
|
- path: Delta-Vector/Orion-Co-Writer-51K |
|
- path: Delta-Vector/Orion-Creative_Writing-Complexity |
|
- path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt |
|
- path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn |
|
- path: PocketDoc/Dans-Failuremaxx-Adventure |
|
- path: PocketDoc/Dans-Logicmaxx-SAT-AP |
|
- path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small |
|
- path: PocketDoc/Dans-Taskmaxx-DataPrepper |
|
- path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long |
|
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2 |
|
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3 |
|
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2 |
|
- path: PocketDoc/Dans-Systemmaxx |
|
``` |
|
|
|
|
|
## Acknowledgements |
|
|
|
- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model. |
|
- Fizz for the pretrain. |
|
- Pocketdoc/Anthracite for da cool datasets. |
|
- Hensen chat. |
|
- Thank you to the illustrator of WataNare for drawing the art used in the model card! |
|
- Thanks to Curse for testing, ideas. |
|
- Thanks to Toasty for some data, ideas. |
|
- Thanks to everyone else in allura! |
|
|
|
ilya <3 |
|
|
|
## Call for Help |
|
If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)... |
|
|
|
Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3 |
|
|
|
## Technical Appendix |
|
<details> |
|
|
|
### Training Notes |
|
|
|
Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks. |
|
|
|
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector) |
|
|
|
|
|
 |
|
|
|
|
|
### Axolotl Config |
|
```yaml |
|
# ============================================================================= |
|
# Model + Saving |
|
# ============================================================================= |
|
base_model: allura-forge/Koto-Small-7b-rc1 |
|
output_dir: ./koto-sft |
|
saves_per_epoch: 2 |
|
deepcompile: true |
|
# ============================================================================= |
|
# DATASET CONFIGURATION |
|
# ============================================================================= |
|
datasets: |
|
- path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet |
|
ds_type: parquet |
|
type: |
|
|
|
shuffle_merged_datasets: true |
|
dataset_prepared_path: ./dataset_prepared |
|
train_on_inputs: false |
|
|
|
# ============================================================================= |
|
# EVALUATION SETTINGS |
|
# ============================================================================= |
|
#evals_per_epoch: 4 |
|
#eval_table_size: |
|
#eval_max_new_tokens: 128 |
|
#eval_sample_packing: false |
|
val_set_size: 0.0 |
|
|
|
# ============================================================================= |
|
# MEMORY OPTIMIZATION |
|
# ============================================================================= |
|
plugins: |
|
- axolotl.integrations.liger.LigerPlugin |
|
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin |
|
liger_rope: true |
|
liger_rms_norm: true |
|
liger_layer_norm: true |
|
liger_glu_activation: true |
|
liger_fused_linear_cross_entropy: false |
|
cut_cross_entropy: true |
|
sample_packing: true |
|
pad_to_sequence_len: true |
|
gradient_checkpointing: true |
|
flash_attention: true |
|
|
|
# ============================================================================= |
|
# MULTI-GPU TRAINING |
|
# ============================================================================= |
|
deepspeed: ./deepspeed_configs/zero2.json |
|
|
|
# ============================================================================= |
|
# LOGGING & MONITORING |
|
# ============================================================================= |
|
wandb_project: Koto-Small |
|
wandb_entity: |
|
wandb_watch: |
|
wandb_name: sft |
|
wandb_log_model: |
|
logging_steps: 1 |
|
debug: false |
|
|
|
# ============================================================================= |
|
# TRAINING PARAMETERS |
|
# ============================================================================= |
|
micro_batch_size: 6 |
|
gradient_accumulation_steps: 2 |
|
num_epochs: 2 |
|
sequence_len: 16000 |
|
optimizer: paged_ademamix_8bit |
|
lr_scheduler: rex |
|
learning_rate: 8e-6 |
|
warmup_ratio: 0.1 |
|
max_grad_norm: 0.0001 |
|
weight_decay: 0.0 |
|
|
|
|
|
# ============================================================================= |
|
# ADDITIONAL SETTINGS |
|
# ============================================================================= |
|
local_rank: |
|
group_by_length: false |
|
early_stopping_patience: |
|
save_safetensors: true |
|
bf16: auto |
|
special_tokens: |
|
``` |
|
|
|
</details> |