Fizzarolli commited on
Commit
09e250c
·
verified ·
1 Parent(s): 6a0c589

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -3,18 +3,19 @@ license: mit
3
  language:
4
  - en
5
  base_model:
6
- - allura-forge/MiMo-7B-Base-Qwenified
7
  library_name: transformers
8
  tags:
9
  - writing
10
  - creative-writing
 
11
  ---
12
 
13
  # Koto Small 7B (Instruct-Tuned)
14
 
15
  ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
16
 
17
- Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for RP/Instruct usecases.
18
 
19
 
20
  ## Usage
@@ -23,7 +24,7 @@ Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Whic
23
 
24
  Trained with ChatML formatting, A typical input would look like this:
25
 
26
- ```py
27
  <|im_start|>system
28
  system prompt<|im_end|>
29
  <|im_start|>user
@@ -70,7 +71,7 @@ datasets:
70
 
71
  ## Acknowledgements
72
 
73
- - Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector) & [Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
74
  - Fizz for the pretrain.
75
  - Pocketdoc/Anthracite for da cool datasets.
76
  - Hensen chat.
@@ -93,10 +94,6 @@ Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](
93
 
94
  Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
95
 
96
- Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
97
-
98
- We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
99
-
100
  ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
101
 
102
 
 
3
  language:
4
  - en
5
  base_model:
6
+ - allura-org/Koto-Small-7B-PT
7
  library_name: transformers
8
  tags:
9
  - writing
10
  - creative-writing
11
+ - roleplay
12
  ---
13
 
14
  # Koto Small 7B (Instruct-Tuned)
15
 
16
  ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
17
 
18
+ Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.
19
 
20
 
21
  ## Usage
 
24
 
25
  Trained with ChatML formatting, A typical input would look like this:
26
 
27
+ ```
28
  <|im_start|>system
29
  system prompt<|im_end|>
30
  <|im_start|>user
 
71
 
72
  ## Acknowledgements
73
 
74
+ - Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
75
  - Fizz for the pretrain.
76
  - Pocketdoc/Anthracite for da cool datasets.
77
  - Hensen chat.
 
94
 
95
  Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
96
 
 
 
 
 
97
  ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
98
 
99