Aurore-Reveil
/

Koto-Small-7B-IT

+---
+license: mit
+language:
+- en
+base_model:
+- allura-forge/MiMo-7B-Base-Qwenified
+library_name: transformers
+tags:
+- writing
+- creative-writing
+---
+# Koto Small 7B (Instruct-Tuned)
+![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
+Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for RP/Instruct usecases.
+## Usage
+### Chat template
+Trained with ChatML formatting, A typical input would look like this:
+```py
+<|im_start|>system
+system prompt<|im_end|>
+<|im_start|>user
+Hi there!<|im_end|>
+<|im_start|>assistant
+Nice to meet you!<|im_end|>
+<|im_start|>user
+Can I ask a question?<|im_end|>
+<|im_start|>assistant
+```
+## Samplers
+We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
+## Datasets
+```yaml
+datasets:
+  - path: Delta-Vector/Hydrus-General-Reasoning
+  - path: Delta-Vector/Hydrus-IF-Mix-Ai2
+  - path: Delta-Vector/Hydrus-Army-Inst
+  - path: Delta-Vector/Hydrus-AM-thinking-Science
+  - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
+  - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
+  - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
+  - path: Delta-Vector/Hydrus-System-Chat-2.0
+  - path: Delta-Vector/Orion-Praxis-Co-Writer
+  - path: Delta-Vector/Orion-Co-Writer-51K
+  - path: Delta-Vector/Orion-Creative_Writing-Complexity
+  - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
+  - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
+  - path: PocketDoc/Dans-Failuremaxx-Adventure
+  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
+  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
+  - path: PocketDoc/Dans-Taskmaxx-DataPrepper
+  - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
+  - path: PocketDoc/Dans-Systemmaxx
+```
+## Acknowledgements
+- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector) & [Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
+- Fizz for the pretrain.
+- Pocketdoc/Anthracite for da cool datasets.
+- Hensen chat.
+- Thank you to the illustrator of WataNare for drawing the art used in the model card!
+- Thanks to Curse for testing, ideas.
+- Thanks to Toasty for some data, ideas.
+- Thanks to everyone else in allura!
+ilya <3
+## Call for Help
+If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...
+Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3
+## Technical Appendix
+<details>
+### Training Notes
+Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
+Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
+We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
+### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)
+### Axolotl Config
+```yaml
+# =============================================================================
+# Model + Saving
+# =============================================================================
+base_model: allura-forge/Koto-Small-7b-rc1
+output_dir: ./koto-sft
+saves_per_epoch: 2
+deepcompile: true
+# =============================================================================
+# DATASET CONFIGURATION
+# =============================================================================
+datasets:
+  - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
+    ds_type: parquet
+    type:
+shuffle_merged_datasets: true
+dataset_prepared_path: ./dataset_prepared
+train_on_inputs: false
+# =============================================================================
+# EVALUATION SETTINGS
+# =============================================================================
+#evals_per_epoch: 4
+#eval_table_size:
+#eval_max_new_tokens: 128
+#eval_sample_packing: false
+val_set_size: 0.0
+# =============================================================================
+# MEMORY OPTIMIZATION
+# =============================================================================
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_layer_norm: true
+liger_glu_activation: true
+liger_fused_linear_cross_entropy: false
+cut_cross_entropy: true
+sample_packing: true
+pad_to_sequence_len: true
+gradient_checkpointing: true
+flash_attention: true
+# =============================================================================
+# MULTI-GPU TRAINING
+# =============================================================================
+deepspeed: ./deepspeed_configs/zero2.json
+# =============================================================================
+# LOGGING & MONITORING
+# =============================================================================
+wandb_project: Koto-Small
+wandb_entity:
+wandb_watch:
+wandb_name: sft
+wandb_log_model:
+logging_steps: 1
+debug: false
+# =============================================================================
+# TRAINING PARAMETERS
+# =============================================================================
+micro_batch_size: 6
+gradient_accumulation_steps: 2
+num_epochs: 2
+sequence_len: 16000
+optimizer: paged_ademamix_8bit
+lr_scheduler: rex
+learning_rate: 8e-6
+warmup_ratio: 0.1
+max_grad_norm: 0.0001
+weight_decay: 0.0
+# =============================================================================
+# ADDITIONAL SETTINGS
+# =============================================================================
+local_rank:
+group_by_length: false
+early_stopping_patience:
+save_safetensors: true
+bf16: auto
+special_tokens:
+```
+</details>