kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN

model information

Llama-3.1-405B-Instructใ‚’AutoAWQใง4bit ้‡ๅญๅŒ–ใ—ใŸใƒขใƒ‡ใƒซใ€‚้‡ๅญๅŒ–ใฎ้š›ใฎใ‚ญใƒฃใƒชใƒ–ใƒฌใƒผใ‚ทใƒงใƒณใƒ‡ใƒผใ‚ฟใซๆ—ฅๆœฌ่ชžใจ่‹ฑ่ชžใ‚’ๅซใ‚€ใƒ‡ใƒผใ‚ฟใ‚’ไฝฟ็”จใ€‚
A model of Llama-3.1-405B-Instruct quantized to 4 bits using AutoAWQ. Calibration data containing Japanese and English was used during the quantization process.

usage

vLLM

from vllm import LLM, SamplingParams
llm = LLM(
    model="kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.97,
    quantization="awq"
)
tokenizer = llm.get_tokenizer()
messages = [
    {"role": "system", "content": "ใ‚ใชใŸใฏๆ—ฅๆœฌ่ชžใงๅฟœ็ญ”ใ™ใ‚‹AIใƒใƒฃใƒƒใƒˆใƒœใƒƒใƒˆใงใ™ใ€‚ใƒฆใƒผใ‚ถใ‚’ใ‚ตใƒใƒผใƒˆใ—ใฆใใ ใ•ใ„ใ€‚"},
    {"role": "user", "content": "plotly.graph_objectsใ‚’ไฝฟใฃใฆๆ•ฃๅธƒๅ›ณใ‚’ไฝœใ‚‹ใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰ใ‚’ๆ›ธใ„ใฆใใ ใ•ใ„ใ€‚"},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.9,
    max_tokens=1024
)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

H100 (94GB)ใ‚’4ๅŸบ็ฉใ‚“ใ ใ‚คใƒณใ‚นใ‚ฟใƒณใ‚นใงใฎๅฎŸ่กŒใฏใ“ใกใ‚‰ใฎnotebookใ‚’ใ”่ฆงใใ ใ•ใ„ใ€‚
Please refer to this notebook for execution on an instance equipped with a four H100 (94GB).

calibration data

ไปฅไธ‹ใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‹ใ‚‰512ๅ€‹ใฎใƒ‡ใƒผใ‚ฟ๏ผŒใƒ—ใƒญใƒณใƒ—ใƒˆใ‚’ๆŠฝๅ‡บใ€‚1ใคใฎใƒ‡ใƒผใ‚ฟใฎใƒˆใƒผใ‚ฏใƒณๆ•ฐใฏๆœ€ๅคง350ๅˆถ้™ใ€‚
Extract 512 data points and prompts from the following dataset. The maximum token limit per data point is 350.

  • TFMC/imatrix-dataset-for-japanese-llm
  • meta-math/MetaMathQA
  • m-a-p/CodeFeedback-Filtered-Instruction
  • kunishou/databricks-dolly-15k-ja
  • ใใฎไป–ๆ—ฅๆœฌ่ชž็‰ˆใƒป่‹ฑ่ชž็‰ˆใฎwikipedia่จ˜ไบ‹ใ‹ใ‚‰ไฝœๆˆใ—ใŸใ‚ชใƒชใ‚ธใƒŠใƒซใƒ‡ใƒผใ‚ฟ๏ผŒๆœ‰ๅฎณใƒ—ใƒญใƒณใƒ—ใƒˆๅ›ž้ฟใฎใŸใ‚ใฎใ‚ชใƒชใ‚ธใƒŠใƒซใƒ‡ใƒผใ‚ฟใ‚’ไฝฟ็”จใ€‚ Original data created from Japanese and English Wikipedia articles, as well as original data for avoiding harmful prompts, is used.

License

MIT Licenseใ‚’้ฉ็”จใ™ใ‚‹ใ€‚ใŸใ ใ—้‡ๅญๅŒ–ใฎใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซใซ้ฉ็”จใ•ใ‚Œใฆใ„ใ‚‹Llama 3.1 Community License Agreementใซๅพ“ใฃใฆใใ ใ•ใ„ใ€‚
The MIT License is applied. However, obey the Llama 3.1 Community License Agreement applied to the base model of quantization.

Downloads last month
13
Safetensors
Model size
57.9B params
Tensor type
FP16
ยท
I32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN

Quantized
(29)
this model