--- license: mit language: - en base_model: - allura-org/Koto-Small-7B-PT library_name: transformers tags: - writing - creative-writing - roleplay --- # Koto Small 7B (Instruct-Tuned) ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png) Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases. ## Usage ### Chat template Trained with ChatML formatting, A typical input would look like this: ``` <|im_start|>system system prompt<|im_end|> <|im_start|>user Hi there!<|im_end|> <|im_start|>assistant Nice to meet you!<|im_end|> <|im_start|>user Can I ask a question?<|im_end|> <|im_start|>assistant ``` ## Samplers We found that 1.25 temperature and 0.05 min_p worked best, but YMMV! ## Datasets ```yaml datasets: - path: Delta-Vector/Hydrus-General-Reasoning - path: Delta-Vector/Hydrus-IF-Mix-Ai2 - path: Delta-Vector/Hydrus-Army-Inst - path: Delta-Vector/Hydrus-AM-thinking-Science - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2 - path: Delta-Vector/Hydrus-System-Chat-2.0 - path: Delta-Vector/Orion-Praxis-Co-Writer - path: Delta-Vector/Orion-Co-Writer-51K - path: Delta-Vector/Orion-Creative_Writing-Complexity - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn - path: PocketDoc/Dans-Failuremaxx-Adventure - path: PocketDoc/Dans-Logicmaxx-SAT-AP - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small - path: PocketDoc/Dans-Taskmaxx-DataPrepper - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2 - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3 - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2 - path: PocketDoc/Dans-Systemmaxx ``` ## Acknowledgements - Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model. - Fizz for the pretrain. - Pocketdoc/Anthracite for da cool datasets. - Hensen chat. - Thank you to the illustrator of WataNare for drawing the art used in the model card! - Thanks to Curse for testing, ideas. - Thanks to Toasty for some data, ideas. - Thanks to everyone else in allura! ilya <3 ## Call for Help If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)... Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3 ## Technical Appendix
### Training Notes Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks. ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png) ### Axolotl Config ```yaml # ============================================================================= # Model + Saving # ============================================================================= base_model: allura-forge/Koto-Small-7b-rc1 output_dir: ./koto-sft saves_per_epoch: 2 deepcompile: true # ============================================================================= # DATASET CONFIGURATION # ============================================================================= datasets: - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet ds_type: parquet type: shuffle_merged_datasets: true dataset_prepared_path: ./dataset_prepared train_on_inputs: false # ============================================================================= # EVALUATION SETTINGS # ============================================================================= #evals_per_epoch: 4 #eval_table_size: #eval_max_new_tokens: 128 #eval_sample_packing: false val_set_size: 0.0 # ============================================================================= # MEMORY OPTIMIZATION # ============================================================================= plugins: - axolotl.integrations.liger.LigerPlugin - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin liger_rope: true liger_rms_norm: true liger_layer_norm: true liger_glu_activation: true liger_fused_linear_cross_entropy: false cut_cross_entropy: true sample_packing: true pad_to_sequence_len: true gradient_checkpointing: true flash_attention: true # ============================================================================= # MULTI-GPU TRAINING # ============================================================================= deepspeed: ./deepspeed_configs/zero2.json # ============================================================================= # LOGGING & MONITORING # ============================================================================= wandb_project: Koto-Small wandb_entity: wandb_watch: wandb_name: sft wandb_log_model: logging_steps: 1 debug: false # ============================================================================= # TRAINING PARAMETERS # ============================================================================= micro_batch_size: 6 gradient_accumulation_steps: 2 num_epochs: 2 sequence_len: 16000 optimizer: paged_ademamix_8bit lr_scheduler: rex learning_rate: 8e-6 warmup_ratio: 0.1 max_grad_norm: 0.0001 weight_decay: 0.0 # ============================================================================= # ADDITIONAL SETTINGS # ============================================================================= local_rank: group_by_length: false early_stopping_patience: save_safetensors: true bf16: auto special_tokens: ```