--- license: mit language: - en base_model: - XiaomiMiMo/MiMo-7B-Base library_name: transformers tags: - writing - creative-writing --- # Koto Small 7B (Pretrained) ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png) Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data. **Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!** ## Usage This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work. It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context. We found that 1.25 temperature and 0.05 min_p worked best, but YMMV! ## Datasets Some of the data used to train this model includes: - Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library)) - A random sample of public domain books from Project Gutenberg - Furry (anthro and feral) storytelling and smut - A small subset of known high-quality books and story data ## Acknowledgements - thank you to [unk] for drawing the art used in the model card! - thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model - thanks to curse for testing, ideas - thanks to toasty for some data, ideas - thanks to everyone else in allura for moral support ilya <3 ## Call for Help if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)... please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3 ## Technical Appendix
### Training Notes This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization. Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise. We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them. ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png) ### Finetuning Notes This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem. ### Axolotl Config ```yaml ## model base_model: allura-forge/MiMo-7B-Base-Qwenified trust_remote_code: true ## qlora COPE!!! load_in_8bit: false load_in_4bit: false strict: false ## data datasets: datasets: - path: estrogen/bookscpt2 type: completion field: text shuffle_merged_datasets: true dataset_prepared_path: dataset_prepareds val_set_size: 0.0 output_dir: ./MiMo-Pretrain ## Liger + CCE plugins: - axolotl.integrations.liger.LigerPlugin - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin liger_rope: true liger_rms_norm: true liger_layer_norm: true liger_glu_activation: true liger_fused_linear_cross_entropy: false cut_cross_entropy: true ## CTX settings sequence_len: 32768 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true ## max grad norm max_grad_norm: 1.0 ## WandB wandb_project: Koto-Small wandb_entity: wandb_watch: wandb_name: MiMo-7b_1e-5_adamw-8bit wandb_log_model: ## hoe params gradient_accumulation_steps: 4 # ??? micro_batch_size: 4 num_epochs: 1 lr_scheduler: cosine learning_rate: 1e-5 optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit" deepcompile: true train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: offload early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true s2_attention: warmup_steps: 50 saves_per_epoch: 2 debug: deepspeed: ./deepspeed_configs/zero2.json weight_decay: 0.0025 fsdp: fsdp_config: ```