Delta-Vector commited on
Commit
6a0c589
·
verified ·
1 Parent(s): eb6309f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - allura-forge/MiMo-7B-Base-Qwenified
7
+ library_name: transformers
8
+ tags:
9
+ - writing
10
+ - creative-writing
11
+ ---
12
+
13
+ # Koto Small 7B (Instruct-Tuned)
14
+
15
+ ![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
16
+
17
+ Koto-Small-7B-IT is an Instruct-tuned version of the Koto-Small-7B-PT base, Which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for RP/Instruct usecases.
18
+
19
+
20
+ ## Usage
21
+
22
+ ### Chat template
23
+
24
+ Trained with ChatML formatting, A typical input would look like this:
25
+
26
+ ```py
27
+ <|im_start|>system
28
+ system prompt<|im_end|>
29
+ <|im_start|>user
30
+ Hi there!<|im_end|>
31
+ <|im_start|>assistant
32
+ Nice to meet you!<|im_end|>
33
+ <|im_start|>user
34
+ Can I ask a question?<|im_end|>
35
+ <|im_start|>assistant
36
+ ```
37
+
38
+ ## Samplers
39
+
40
+ We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
41
+
42
+ ## Datasets
43
+
44
+ ```yaml
45
+ datasets:
46
+ - path: Delta-Vector/Hydrus-General-Reasoning
47
+ - path: Delta-Vector/Hydrus-IF-Mix-Ai2
48
+ - path: Delta-Vector/Hydrus-Army-Inst
49
+ - path: Delta-Vector/Hydrus-AM-thinking-Science
50
+ - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
51
+ - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
52
+ - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
53
+ - path: Delta-Vector/Hydrus-System-Chat-2.0
54
+ - path: Delta-Vector/Orion-Praxis-Co-Writer
55
+ - path: Delta-Vector/Orion-Co-Writer-51K
56
+ - path: Delta-Vector/Orion-Creative_Writing-Complexity
57
+ - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
58
+ - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
59
+ - path: PocketDoc/Dans-Failuremaxx-Adventure
60
+ - path: PocketDoc/Dans-Logicmaxx-SAT-AP
61
+ - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
62
+ - path: PocketDoc/Dans-Taskmaxx-DataPrepper
63
+ - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
64
+ - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
65
+ - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
66
+ - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
67
+ - path: PocketDoc/Dans-Systemmaxx
68
+ ```
69
+
70
+
71
+ ## Acknowledgements
72
+
73
+ - Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector) & [Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
74
+ - Fizz for the pretrain.
75
+ - Pocketdoc/Anthracite for da cool datasets.
76
+ - Hensen chat.
77
+ - Thank you to the illustrator of WataNare for drawing the art used in the model card!
78
+ - Thanks to Curse for testing, ideas.
79
+ - Thanks to Toasty for some data, ideas.
80
+ - Thanks to everyone else in allura!
81
+
82
+ ilya <3
83
+
84
+ ## Call for Help
85
+ If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...
86
+
87
+ Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3
88
+
89
+ ## Technical Appendix
90
+ <details>
91
+
92
+ ### Training Notes
93
+
94
+ Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
95
+
96
+ Before training, The model was already [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This opened up the usage of CCE and Liger which let the train go much faster than it would have otherwise.
97
+
98
+ We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
99
+
100
+ ### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
101
+
102
+
103
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)
104
+
105
+
106
+ ### Axolotl Config
107
+ ```yaml
108
+ # =============================================================================
109
+ # Model + Saving
110
+ # =============================================================================
111
+ base_model: allura-forge/Koto-Small-7b-rc1
112
+ output_dir: ./koto-sft
113
+ saves_per_epoch: 2
114
+ deepcompile: true
115
+ # =============================================================================
116
+ # DATASET CONFIGURATION
117
+ # =============================================================================
118
+ datasets:
119
+ - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
120
+ ds_type: parquet
121
+ type:
122
+
123
+ shuffle_merged_datasets: true
124
+ dataset_prepared_path: ./dataset_prepared
125
+ train_on_inputs: false
126
+
127
+ # =============================================================================
128
+ # EVALUATION SETTINGS
129
+ # =============================================================================
130
+ #evals_per_epoch: 4
131
+ #eval_table_size:
132
+ #eval_max_new_tokens: 128
133
+ #eval_sample_packing: false
134
+ val_set_size: 0.0
135
+
136
+ # =============================================================================
137
+ # MEMORY OPTIMIZATION
138
+ # =============================================================================
139
+ plugins:
140
+ - axolotl.integrations.liger.LigerPlugin
141
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
142
+ liger_rope: true
143
+ liger_rms_norm: true
144
+ liger_layer_norm: true
145
+ liger_glu_activation: true
146
+ liger_fused_linear_cross_entropy: false
147
+ cut_cross_entropy: true
148
+ sample_packing: true
149
+ pad_to_sequence_len: true
150
+ gradient_checkpointing: true
151
+ flash_attention: true
152
+
153
+ # =============================================================================
154
+ # MULTI-GPU TRAINING
155
+ # =============================================================================
156
+ deepspeed: ./deepspeed_configs/zero2.json
157
+
158
+ # =============================================================================
159
+ # LOGGING & MONITORING
160
+ # =============================================================================
161
+ wandb_project: Koto-Small
162
+ wandb_entity:
163
+ wandb_watch:
164
+ wandb_name: sft
165
+ wandb_log_model:
166
+ logging_steps: 1
167
+ debug: false
168
+
169
+ # =============================================================================
170
+ # TRAINING PARAMETERS
171
+ # =============================================================================
172
+ micro_batch_size: 6
173
+ gradient_accumulation_steps: 2
174
+ num_epochs: 2
175
+ sequence_len: 16000
176
+ optimizer: paged_ademamix_8bit
177
+ lr_scheduler: rex
178
+ learning_rate: 8e-6
179
+ warmup_ratio: 0.1
180
+ max_grad_norm: 0.0001
181
+ weight_decay: 0.0
182
+
183
+
184
+ # =============================================================================
185
+ # ADDITIONAL SETTINGS
186
+ # =============================================================================
187
+ local_rank:
188
+ group_by_length: false
189
+ early_stopping_patience:
190
+ save_safetensors: true
191
+ bf16: auto
192
+ special_tokens:
193
+ ```
194
+
195
+ </details>