devdock4bit / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag, library name and link to paper
0dd8546 verified
|
raw
history blame
23.1 kB
metadata
license: other
library_name: transformers
pipeline_tag: text-generation

# LLaMA Factory

This repository provides the codebase as presented in Autonomous Data Selection with Language Models for Mathematical Texts.

GitHub Repo stars GitHub Code License GitHub last commit PyPI Citation GitHub pull request Discord Twitter Open in Colab Open in DSW Spaces Studios

GitHub Tread

👋 Join our WeChat or NPU user group.

[ English | 中文 ]

Fine-tuning a large language model can be easy as...

https://github.com/user-attachments/assets/7c96b465-9df7-45f4-8053-bf03e58386d3

Choose your path:

Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.

Table of Contents

Features

  • Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Qwen2-VL, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
  • Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
  • Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
  • Advanced algorithms: GaLore, BAdam, Adam-mini, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ, PiSSA and Agent tuning.
  • Practical tricks: FlashAttention-2, Unsloth, Liger Kernel, RoPE scaling, NEFTune and rsLoRA.
  • Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
  • Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker.

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

benchmark

Definitions
  • Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
  • Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
  • GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
  • We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA Factory's LoRA tuning.

Changelog

[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.

[24/09/19] We support fine-tuning the Qwen2.5 models.

[24/08/30] We support fine-tuning the Qwen2-VL models. Thank @simonJJJ's PR.

[24/08/27] We support Liger Kernel. Try enable_liger_kernel: true for efficient training.

[24/08/09] We support Adam-mini optimizer. See examples for usage. Thank @relic-yuexi's PR.

Full Changelog

[24/07/04] We supported contamination-free packed training. Use neat_packing: true to activate it. Thank @chuan298's PR.

[24/06/16] We supported PiSSA algorithm. See examples for usage.

[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.

[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.

[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with paligemma template for chat completion.

[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.

[24/04/16] We supported BAdam optimizer. See examples for usage.

[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

[24/03/31] We supported ORPO. See examples for usage.

[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.

[24/03/13] We supported LoRA+. See examples for usage.

[24/03/07] We supported GaLore optimizer. See examples for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm to enjoy 270% inference speed.

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall_en.

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.\n [23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5 argument to activate NEFTune.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.

[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.

[23/07/31] We supported dataset streaming. Try streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡ 🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.

Supported Models

Model Model size Template
Baichuan 2 7B/13B baichuan2
BLOOM/BLOOMZ 560M/1.1B/1.7B/3B/7.1B/176B -
ChatGLM3 6B chatglm3
Command R 35B/104B cohere
DeepSeek (Code/MoE) 7B/16B/67B/236B deepseek
Falcon 7B/11B/40B/180B falcon
Gemma/Gemma 2/CodeGemma 2B/7B/9B/27B gemma
GLM-4 9B glm4
Llama 7B/13B/33B/65B -
Llama 2 7B/13B/70B llama2
Llama 3 8B/70B llama3
LLaVA-1.5 7B/13B llava
LLaVA-NeXT 7B/8B/13B/34B/72B/110B llava_next
LLaVA-NeXT-Video 7B/34B llava_next_video
MiniCPM 1B/2B/4B cpm/cpm3
Mistral/Mixtral 7B/8x7B/8x22B mistral
OLMo 1B/7B -
PaliGemma 3B paligemma
Phi-1.5/Phi-2 1.3B/2.7B -
Qwen (1-2.5) (Code/Math/MoE) 0.5B/1.5B/3B/7B/14B/32B/72B/110B qwen
StarCoder 2 3B/7B/15B -
XVERSE 7B/13B/65B xverse
Yi/Yi-1.5 (Code) 1.5B/6B/9B/34B yi
Yi-VL 6B/34B yi_vl
Yuan 2 2B/51B/102B yuan

For the "base" models, the template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "instruct/chat" models.

Remember to use the SAME template in training and inference.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

Approach Full-tuning Freeze-tuning LoRA QLoRA
Pre-Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Supervised Fine-Tuning :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Reward Modeling :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
PPO Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
DPO Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
KTO Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
ORPO Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
SimPO Training :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

The implementation details of PPO can be found in this blog.

Provided Datasets

Pre-training datasets
Supervised fine-tuning datasets