Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- allura-forge/MiMo-7B-Base-Qwenified
|
7 |
+
library_name: transformers
|
8 |
+
tags:
|
9 |
+
- writing
|
10 |
+
- creative-writing
|
11 |
+
---
|
12 |
+
|
13 |
+
# Koto Small 7B (Instruct-Tuned)
|
14 |
+
|
15 |
+

|
16 |
+
|
17 |
+
Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for RP/Instruct usecases.
|
18 |
+
|
19 |
+
|
20 |
+
## Usage
|
21 |
+
|
22 |
+
### Chat template
|
23 |
+
|
24 |
+
Trained with ChatML formatting, A typical input would look like this:
|
25 |
+
|
26 |
+
```py
|
27 |
+
<|im_start|>system
|
28 |
+
system prompt<|im_end|>
|
29 |
+
<|im_start|>user
|
30 |
+
Hi there!<|im_end|>
|
31 |
+
<|im_start|>assistant
|
32 |
+
Nice to meet you!<|im_end|>
|
33 |
+
<|im_start|>user
|
34 |
+
Can I ask a question?<|im_end|>
|
35 |
+
<|im_start|>assistant
|
36 |
+
```
|
37 |
+
|
38 |
+
## Samplers
|
39 |
+
|
40 |
+
We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
|
41 |
+
|
42 |
+
## Datasets
|
43 |
+
|
44 |
+
```yaml
|
45 |
+
datasets:
|
46 |
+
- path: Delta-Vector/Hydrus-General-Reasoning
|
47 |
+
- path: Delta-Vector/Hydrus-IF-Mix-Ai2
|
48 |
+
- path: Delta-Vector/Hydrus-Army-Inst
|
49 |
+
- path: Delta-Vector/Hydrus-AM-thinking-Science
|
50 |
+
- path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
|
51 |
+
- path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
|
52 |
+
- path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
|
53 |
+
- path: Delta-Vector/Hydrus-System-Chat-2.0
|
54 |
+
- path: Delta-Vector/Orion-Praxis-Co-Writer
|
55 |
+
- path: Delta-Vector/Orion-Co-Writer-51K
|
56 |
+
- path: Delta-Vector/Orion-Creative_Writing-Complexity
|
57 |
+
- path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
|
58 |
+
- path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
|
59 |
+
- path: PocketDoc/Dans-Failuremaxx-Adventure
|
60 |
+
- path: PocketDoc/Dans-Logicmaxx-SAT-AP
|
61 |
+
- path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
|
62 |
+
- path: PocketDoc/Dans-Taskmaxx-DataPrepper
|
63 |
+
- path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
|
64 |
+
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
|
65 |
+
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
|
66 |
+
- path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
|
67 |
+
- path: PocketDoc/Dans-Systemmaxx
|
68 |
+
```
|
69 |
+
|
70 |
+
|
71 |
+
## Acknowledgements
|
72 |
+
|
73 |
+
- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector) & [Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
|
74 |
+
- Fizz for the pretrain.
|
75 |
+
- Pocketdoc/Anthracite for da cool datasets.
|
76 |
+
- Hensen chat.
|
77 |
+
- Thank you to the illustrator of WataNare for drawing the art used in the model card!
|
78 |
+
- Thanks to Curse for testing, ideas.
|
79 |
+
- Thanks to Toasty for some data, ideas.
|
80 |
+
- Thanks to everyone else in allura!
|
81 |
+
|
82 |
+
ilya <3
|
83 |
+
|
84 |
+
## Call for Help
|
85 |
+
If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...
|
86 |
+
|
87 |
+
Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3
|
88 |
+
|
89 |
+
## Technical Appendix
|
90 |
+
<details>
|
91 |
+
|
92 |
+
### Training Notes
|
93 |
+
|
94 |
+
Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
|
95 |
+
|
96 |
+
Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
|
97 |
+
|
98 |
+
We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
|
99 |
+
|
100 |
+
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
|
101 |
+
|
102 |
+
|
103 |
+

|
104 |
+
|
105 |
+
|
106 |
+
### Axolotl Config
|
107 |
+
```yaml
|
108 |
+
# =============================================================================
|
109 |
+
# Model + Saving
|
110 |
+
# =============================================================================
|
111 |
+
base_model: allura-forge/Koto-Small-7b-rc1
|
112 |
+
output_dir: ./koto-sft
|
113 |
+
saves_per_epoch: 2
|
114 |
+
deepcompile: true
|
115 |
+
# =============================================================================
|
116 |
+
# DATASET CONFIGURATION
|
117 |
+
# =============================================================================
|
118 |
+
datasets:
|
119 |
+
- path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
|
120 |
+
ds_type: parquet
|
121 |
+
type:
|
122 |
+
|
123 |
+
shuffle_merged_datasets: true
|
124 |
+
dataset_prepared_path: ./dataset_prepared
|
125 |
+
train_on_inputs: false
|
126 |
+
|
127 |
+
# =============================================================================
|
128 |
+
# EVALUATION SETTINGS
|
129 |
+
# =============================================================================
|
130 |
+
#evals_per_epoch: 4
|
131 |
+
#eval_table_size:
|
132 |
+
#eval_max_new_tokens: 128
|
133 |
+
#eval_sample_packing: false
|
134 |
+
val_set_size: 0.0
|
135 |
+
|
136 |
+
# =============================================================================
|
137 |
+
# MEMORY OPTIMIZATION
|
138 |
+
# =============================================================================
|
139 |
+
plugins:
|
140 |
+
- axolotl.integrations.liger.LigerPlugin
|
141 |
+
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
142 |
+
liger_rope: true
|
143 |
+
liger_rms_norm: true
|
144 |
+
liger_layer_norm: true
|
145 |
+
liger_glu_activation: true
|
146 |
+
liger_fused_linear_cross_entropy: false
|
147 |
+
cut_cross_entropy: true
|
148 |
+
sample_packing: true
|
149 |
+
pad_to_sequence_len: true
|
150 |
+
gradient_checkpointing: true
|
151 |
+
flash_attention: true
|
152 |
+
|
153 |
+
# =============================================================================
|
154 |
+
# MULTI-GPU TRAINING
|
155 |
+
# =============================================================================
|
156 |
+
deepspeed: ./deepspeed_configs/zero2.json
|
157 |
+
|
158 |
+
# =============================================================================
|
159 |
+
# LOGGING & MONITORING
|
160 |
+
# =============================================================================
|
161 |
+
wandb_project: Koto-Small
|
162 |
+
wandb_entity:
|
163 |
+
wandb_watch:
|
164 |
+
wandb_name: sft
|
165 |
+
wandb_log_model:
|
166 |
+
logging_steps: 1
|
167 |
+
debug: false
|
168 |
+
|
169 |
+
# =============================================================================
|
170 |
+
# TRAINING PARAMETERS
|
171 |
+
# =============================================================================
|
172 |
+
micro_batch_size: 6
|
173 |
+
gradient_accumulation_steps: 2
|
174 |
+
num_epochs: 2
|
175 |
+
sequence_len: 16000
|
176 |
+
optimizer: paged_ademamix_8bit
|
177 |
+
lr_scheduler: rex
|
178 |
+
learning_rate: 8e-6
|
179 |
+
warmup_ratio: 0.1
|
180 |
+
max_grad_norm: 0.0001
|
181 |
+
weight_decay: 0.0
|
182 |
+
|
183 |
+
|
184 |
+
# =============================================================================
|
185 |
+
# ADDITIONAL SETTINGS
|
186 |
+
# =============================================================================
|
187 |
+
local_rank:
|
188 |
+
group_by_length: false
|
189 |
+
early_stopping_patience:
|
190 |
+
save_safetensors: true
|
191 |
+
bf16: auto
|
192 |
+
special_tokens:
|
193 |
+
```
|
194 |
+
|
195 |
+
</details>
|