TinyLlama 1.1B Chat 1.0 - DeepSparse
This repo contains model files for TinyLlama 1.1B Chat optimized for DeepSparse, a CPU inference runtime for sparse models.
This model was quantized and pruned with SparseGPT, using SparseML.
Inference
Install DeepSparse LLM for fast inference on CPUs:
pip install deepsparse-nightly[llm]
Run in a Python pipeline:
from deepsparse import TextGeneration
prompt = "How to make banana bread?"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
model = TextGeneration(model_path="hf:nm-testing/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=200).generations[0].text)
"""
"""
Prompt template
<|im_start|>user\n
{prompt}<|im_end|>\n
<|im_start|>assistant\n
Sparsification
For details on how this model was sparsified, see the recipe.yaml
in this repo and follow the instructions below.
git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v1.0 open_platypus --precision float16 --recipe recipe.yaml --save True
Sparse Finetuning
Continue training the sparse model to improve accuracy:
from sparseml.transformers.finetune.text_generation import run_train
model = "./obcq_deployment"
teacher_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
dataset_name = "open_platypus"
concatenate_data = False
output_dir = "./output_finetune"
recipe = "recipe.yaml"
num_train_epochs=2
overwrite_output_dir = True
splits = {
"train": "train[:50%]",
}
run_train(
model_name_or_path=model,
distill_teacher=teacher_model,
dataset_name=dataset_name,
output_dir=output_dir,
recipe=recipe,
num_train_epochs=num_train_epochs,
overwrite_output_dir=overwrite_output_dir,
concatenate_data = concatenate_data,
splits = splits
)
Export Model
Export the model while injecting the KV Cache
sparseml.export --task text-generation output_finetune/
Follow the instructions on our One Shot With SparseML page for a step-by-step guide for performing one-shot quantization of large language models.
Slack
For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model authors have turned it off explicitly.
Model tree for nm-testing/TinyLlama-1.1B-Chat-v1.0-open_platypus-pruned50-quant-ds
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0