Koto-Small-7B-IT / README.md

Update README.md

09e250c verified 7 days ago

6.11 kB

	---
	license: mit
	language:
	- en
	base_model:
	- allura-org/Koto-Small-7B-PT
	library_name: transformers
	tags:
	- writing
	- creative-writing
	- roleplay
	---

	# Koto Small 7B (Instruct-Tuned)

	![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)

	Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.


	## Usage

	### Chat template

	Trained with ChatML formatting, A typical input would look like this:

	```
	<\|im_start\|>system
	system prompt<\|im_end\|>
	<\|im_start\|>user
	Hi there!<\|im_end\|>
	<\|im_start\|>assistant
	Nice to meet you!<\|im_end\|>
	<\|im_start\|>user
	Can I ask a question?<\|im_end\|>
	<\|im_start\|>assistant
	```

	## Samplers

	We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

	## Datasets

	```yaml
	datasets:
	- path: Delta-Vector/Hydrus-General-Reasoning
	- path: Delta-Vector/Hydrus-IF-Mix-Ai2
	- path: Delta-Vector/Hydrus-Army-Inst
	- path: Delta-Vector/Hydrus-AM-thinking-Science
	- path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
	- path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
	- path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
	- path: Delta-Vector/Hydrus-System-Chat-2.0
	- path: Delta-Vector/Orion-Praxis-Co-Writer
	- path: Delta-Vector/Orion-Co-Writer-51K
	- path: Delta-Vector/Orion-Creative_Writing-Complexity
	- path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
	- path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
	- path: PocketDoc/Dans-Failuremaxx-Adventure
	- path: PocketDoc/Dans-Logicmaxx-SAT-AP
	- path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
	- path: PocketDoc/Dans-Taskmaxx-DataPrepper
	- path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
	- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
	- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
	- path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
	- path: PocketDoc/Dans-Systemmaxx
	```


	## Acknowledgements

	- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
	- Fizz for the pretrain.
	- Pocketdoc/Anthracite for da cool datasets.
	- Hensen chat.
	- Thank you to the illustrator of WataNare for drawing the art used in the model card!
	- Thanks to Curse for testing, ideas.
	- Thanks to Toasty for some data, ideas.
	- Thanks to everyone else in allura!

	ilya <3

	## Call for Help
	If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...

	Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3

	## Technical Appendix
	<details>

	### Training Notes

	Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.

	### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)


	### Axolotl Config
	```yaml
	# =============================================================================
	# Model + Saving
	# =============================================================================
	base_model: allura-forge/Koto-Small-7b-rc1
	output_dir: ./koto-sft
	saves_per_epoch: 2
	deepcompile: true
	# =============================================================================
	# DATASET CONFIGURATION
	# =============================================================================
	datasets:
	- path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
	ds_type: parquet
	type:

	shuffle_merged_datasets: true
	dataset_prepared_path: ./dataset_prepared
	train_on_inputs: false

	# =============================================================================
	# EVALUATION SETTINGS
	# =============================================================================
	#evals_per_epoch: 4
	#eval_table_size:
	#eval_max_new_tokens: 128
	#eval_sample_packing: false
	val_set_size: 0.0

	# =============================================================================
	# MEMORY OPTIMIZATION
	# =============================================================================
	plugins:
	- axolotl.integrations.liger.LigerPlugin
	- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
	liger_rope: true
	liger_rms_norm: true
	liger_layer_norm: true
	liger_glu_activation: true
	liger_fused_linear_cross_entropy: false
	cut_cross_entropy: true
	sample_packing: true
	pad_to_sequence_len: true
	gradient_checkpointing: true
	flash_attention: true

	# =============================================================================
	# MULTI-GPU TRAINING
	# =============================================================================
	deepspeed: ./deepspeed_configs/zero2.json

	# =============================================================================
	# LOGGING & MONITORING
	# =============================================================================
	wandb_project: Koto-Small
	wandb_entity:
	wandb_watch:
	wandb_name: sft
	wandb_log_model:
	logging_steps: 1
	debug: false

	# =============================================================================
	# TRAINING PARAMETERS
	# =============================================================================
	micro_batch_size: 6
	gradient_accumulation_steps: 2
	num_epochs: 2
	sequence_len: 16000
	optimizer: paged_ademamix_8bit
	lr_scheduler: rex
	learning_rate: 8e-6
	warmup_ratio: 0.1
	max_grad_norm: 0.0001
	weight_decay: 0.0


	# =============================================================================
	# ADDITIONAL SETTINGS
	# =============================================================================
	local_rank:
	group_by_length: false
	early_stopping_patience:
	save_safetensors: true
	bf16: auto
	special_tokens:
	```

	</details>