TinyLlama 1.1B Chat 1.0 - DeepSparse

This repo contains model files for TinyLlama 1.1B Chat optimized for DeepSparse, a CPU inference runtime for sparse models.

This model was quantized and pruned with SparseGPT, using SparseML.

Inference

Install DeepSparse LLM for fast inference on CPUs:

pip install deepsparse-nightly[llm]

Run in a Python pipeline:

from deepsparse import TextGeneration

prompt = "How to make banana bread?"
formatted_prompt =  f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

model = TextGeneration(model_path="hf:nm-testing/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=200).generations[0].text)

"""


"""

Prompt template

<|im_start|>user\n
{prompt}<|im_end|>\n
<|im_start|>assistant\n

Sparsification

For details on how this model was sparsified, see the recipe.yaml in this repo and follow the instructions below.

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v1.0 open_platypus --precision float16  --recipe recipe.yaml --save True

Sparse Finetuning

Continue training the sparse model to improve accuracy:

from sparseml.transformers.finetune.text_generation import run_train


model = "./obcq_deployment"
teacher_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
dataset_name = "open_platypus"
concatenate_data = False
output_dir = "./output_finetune"
recipe = "recipe.yaml"
num_train_epochs=2
overwrite_output_dir = True
splits = {
    "train": "train[:50%]",
}

run_train(
    model_name_or_path=model,
    distill_teacher=teacher_model,
    dataset_name=dataset_name,
    output_dir=output_dir,
    recipe=recipe,
    num_train_epochs=num_train_epochs,
    overwrite_output_dir=overwrite_output_dir,
    concatenate_data = concatenate_data,
    splits = splits
)

Export Model

Export the model while injecting the KV Cache

sparseml.export --task text-generation output_finetune/

Follow the instructions on our One Shot With SparseML page for a step-by-step guide for performing one-shot quantization of large language models.

Slack

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for nm-testing/TinyLlama-1.1B-Chat-v1.0-open_platypus-pruned50-quant-ds

Finetuned
(197)
this model