README.md · RedHatAI/Meta-Llama-3-8B-Instruct-FP8-KV at 00937dcc4541fdb515ce62c8d4c51697a87dbc16

metadata

tags:
  - fp8
  - vllm

Meta-Llama-3-8B-Instruct quantized to FP8 weights and activations using per-tensor quantization, ready for inference with vLLM >= 0.5.0. This model checkpoint also includes experimental per-tensor scales for FP8 quantized KV Cache, accessed through the --kv-cache-dtype fp8 argument in vLLM.

from vllm import LLM

model = LLM(model="nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV", kv_cache_dtype="fp8")
result = model.generate("Hello, my name is")