Aurore-Reveil
/

Koto-Small-7B-IT

@@ -3,18 +3,19 @@ license: mit
 language:
 - en
 base_model:
-- allura-forge/MiMo-7B-Base-Qwenified
 library_name: transformers
 tags:
 - writing
 - creative-writing
 ---
 # Koto Small 7B (Instruct-Tuned)
 ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
-Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for RP/Instruct usecases.
 ## Usage
@@ -23,7 +24,7 @@ Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Whic
 Trained with ChatML formatting, A typical input would look like this:
-```py
 <|im_start|>system
 system prompt<|im_end|>
 <|im_start|>user
@@ -70,7 +71,7 @@ datasets:
 ## Acknowledgements
-- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector) & [Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
 - Fizz for the pretrain.
 - Pocketdoc/Anthracite for da cool datasets.
 - Hensen chat.
@@ -93,10 +94,6 @@ Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](
 Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
-Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
-We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
 ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)

 language:
 - en
 base_model:
+- allura-org/Koto-Small-7B-PT
 library_name: transformers
 tags:
 - writing
 - creative-writing
+- roleplay
 ---
 # Koto Small 7B (Instruct-Tuned)
 ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
+Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.
 ## Usage
 Trained with ChatML formatting, A typical input would look like this:
+```
 <|im_start|>system
 system prompt<|im_end|>
 <|im_start|>user
 ## Acknowledgements
+- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
 - Fizz for the pretrain.
 - Pocketdoc/Anthracite for da cool datasets.
 - Hensen chat.
 Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
 ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)