Llama2-7b-gsm8k-pt

This repo contains model files for llama2-7b-gsm8k-pt optimized for DeepSparse, a CPU inference runtime for sparse models.

This model was quantized and pruned with SparseGPT, using SparseML.

Inference

Install DeepSparse LLM for fast inference on CPUs:

pip install deepsparse-nightly[llm]

Run in a Python pipeline:

from deepsparse import TextGeneration

prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
formatted_prompt =  f"Question:{prompt}\nAnswer:"

model = TextGeneration(model_path="hf:nm-testing/llama2-7b-gsm8k-pt-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=200).generations[0].text)
"""
First find the total distance of one sprint: 60 meters * 3 = <<60*3=180>>180 meters
Then multiply the distance of one sprint by the number of sprints per week: 180 meters/sprint * 3 sprints/week = <<180*3=540>>540 meters/week
#### 540
"""

To obtain the final model the following process was followed:

  • Sparsify the model to 50% using SparseML
  • Fine-tune the sparse model on the GSM8K dataset
  • Perform one-shot quantization of the resulting model
Downloads last month
10
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for nm-testing/llama2-7b-gsm8k-pt-pruned50-quant-ds

Quantized
(1)
this model