README.md · MachoMaheen/devdock4bit at refs/pr/1

metadata

license: other
library_name: transformers
pipeline_tag: text-generation

This repository provides the codebase as presented in Autonomous Data Selection with Language Models for Mathematical Texts.

👋 Join our WeChat or NPU user group.

[ English | 中文 ]

Fine-tuning a large language model can be easy as...

https://github.com/user-attachments/assets/7c96b465-9df7-45f4-8053-bf03e58386d3

Choose your path:

Colab: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
PAI-DSW: Llama3 Example | Qwen2-VL Example
Local machine: Please refer to usage
Documentation (WIP): https://llamafactory.readthedocs.io/zh-cn/latest/

Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.

Features
Benchmark
Changelog
Supported Models
Supported Training Approaches
Provided Datasets
Requirement
Getting Started
Projects using LLaMA Factory
License
Citation
Acknowledgement

Features

Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Qwen2-VL, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
Advanced algorithms: GaLore, BAdam, Adam-mini, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ, PiSSA and Agent tuning.
Practical tricks: FlashAttention-2, Unsloth, Liger Kernel, RoPE scaling, NEFTune and rsLoRA.
Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker.

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

Definitions

Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA Factory's LoRA tuning.

Changelog

[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.

[24/09/19] We support fine-tuning the Qwen2.5 models.

[24/08/30] We support fine-tuning the Qwen2-VL models. Thank @simonJJJ's PR.

[24/08/27] We support Liger Kernel. Try enable_liger_kernel: true for efficient training.

[24/08/09] We support Adam-mini optimizer. See examples for usage. Thank @relic-yuexi's PR.

Full Changelog

[24/07/04] We supported contamination-free packed training. Use neat_packing: true to activate it. Thank @chuan298's PR.

[24/06/16] We supported PiSSA algorithm. See examples for usage.

[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.

[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.

[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with paligemma template for chat completion.

[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.

[24/04/16] We supported BAdam optimizer. See examples for usage.

[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

[24/03/31] We supported ORPO. See examples for usage.

[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.

[24/03/13] We supported LoRA+. See examples for usage.

[24/03/07] We supported GaLore optimizer. See examples for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm to enjoy 270% inference speed.

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall_en.

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.\n [23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5 argument to activate NEFTune.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.

[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.

[23/07/31] We supported dataset streaming. Try streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡ 🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.

Supported Models

Model	Model size	Template
Baichuan 2	7B/13B	baichuan2
BLOOM/BLOOMZ	560M/1.1B/1.7B/3B/7.1B/176B	-
ChatGLM3	6B	chatglm3
Command R	35B/104B	cohere
DeepSeek (Code/MoE)	7B/16B/67B/236B	deepseek
Falcon	7B/11B/40B/180B	falcon
Gemma/Gemma 2/CodeGemma	2B/7B/9B/27B	gemma
GLM-4	9B	glm4
Llama	7B/13B/33B/65B	-
Llama 2	7B/13B/70B	llama2
Llama 3	8B/70B	llama3
LLaVA-1.5	7B/13B	llava
LLaVA-NeXT	7B/8B/13B/34B/72B/110B	llava_next
LLaVA-NeXT-Video	7B/34B	llava_next_video
MiniCPM	1B/2B/4B	cpm/cpm3
Mistral/Mixtral	7B/8x7B/8x22B	mistral
OLMo	1B/7B	-
PaliGemma	3B	paligemma
Phi-1.5/Phi-2	1.3B/2.7B	-
Qwen (1-2.5) (Code/Math/MoE)	0.5B/1.5B/3B/7B/14B/32B/72B/110B	qwen
StarCoder 2	3B/7B/15B	-
XVERSE	7B/13B/65B	xverse
Yi/Yi-1.5 (Code)	1.5B/6B/9B/34B	yi
Yi-VL	6B/34B	yi_vl
Yuan 2	2B/51B/102B	yuan

For the "base" models, the template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "instruct/chat" models.

Remember to use the SAME template in training and inference.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

Approach	Full-tuning	Freeze-tuning	LoRA	QLoRA
Pre-Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Supervised Fine-Tuning	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Reward Modeling	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
PPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
DPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
KTO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
ORPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
SimPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:

The implementation details of PPO can be found in this blog.

Provided Datasets

Pre-training datasets

Supervised fine-tuning datasets

MachoMaheen
/

devdock4bit

Table of Contents

Features

Benchmark

Changelog

Supported Models

Supported Training Approaches

Provided Datasets