Update README.md
Browse files
README.md
CHANGED
@@ -3,18 +3,19 @@ license: mit
|
|
3 |
language:
|
4 |
- en
|
5 |
base_model:
|
6 |
-
- allura-
|
7 |
library_name: transformers
|
8 |
tags:
|
9 |
- writing
|
10 |
- creative-writing
|
|
|
11 |
---
|
12 |
|
13 |
# Koto Small 7B (Instruct-Tuned)
|
14 |
|
15 |

|
16 |
|
17 |
-
Koto-Small-7B-IT is an
|
18 |
|
19 |
|
20 |
## Usage
|
@@ -23,7 +24,7 @@ Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Whic
|
|
23 |
|
24 |
Trained with ChatML formatting, A typical input would look like this:
|
25 |
|
26 |
-
```
|
27 |
<|im_start|>system
|
28 |
system prompt<|im_end|>
|
29 |
<|im_start|>user
|
@@ -70,7 +71,7 @@ datasets:
|
|
70 |
|
71 |
## Acknowledgements
|
72 |
|
73 |
-
- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)
|
74 |
- Fizz for the pretrain.
|
75 |
- Pocketdoc/Anthracite for da cool datasets.
|
76 |
- Hensen chat.
|
@@ -93,10 +94,6 @@ Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](
|
|
93 |
|
94 |
Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
|
95 |
|
96 |
-
Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
|
97 |
-
|
98 |
-
We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
|
99 |
-
|
100 |
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
|
101 |
|
102 |
|
|
|
3 |
language:
|
4 |
- en
|
5 |
base_model:
|
6 |
+
- allura-org/Koto-Small-7B-PT
|
7 |
library_name: transformers
|
8 |
tags:
|
9 |
- writing
|
10 |
- creative-writing
|
11 |
+
- roleplay
|
12 |
---
|
13 |
|
14 |
# Koto Small 7B (Instruct-Tuned)
|
15 |
|
16 |

|
17 |
|
18 |
+
Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.
|
19 |
|
20 |
|
21 |
## Usage
|
|
|
24 |
|
25 |
Trained with ChatML formatting, A typical input would look like this:
|
26 |
|
27 |
+
```
|
28 |
<|im_start|>system
|
29 |
system prompt<|im_end|>
|
30 |
<|im_start|>user
|
|
|
71 |
|
72 |
## Acknowledgements
|
73 |
|
74 |
+
- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
|
75 |
- Fizz for the pretrain.
|
76 |
- Pocketdoc/Anthracite for da cool datasets.
|
77 |
- Hensen chat.
|
|
|
94 |
|
95 |
Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
|
96 |
|
|
|
|
|
|
|
|
|
97 |
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
|
98 |
|
99 |
|