jsbaicenter
/

Llama-3.3-70b-Instruct-AWQ-4BIT-GEMM

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

uyiosa commited on Feb 19

Commit

2a03737

·

verified ·

1 Parent(s): 6555649

Update README.md

Files changed (1) hide show

README.md +11 -4

README.md CHANGED Viewed

@@ -19,11 +19,18 @@ print(model)
 ## Loading this model with VLLM via docker
 ```
-docker run --runtime nvidia --gpus all --env "HUGGING_FACE_HUB_TOKEN = .........."  -p 8000:8000 \
 --ipc=host --model jsbaicenter/Llama-3.3-70b-Instruct-AWQ-4BIT-GEMM \
---gpu-memory-utilization 0.9 --swap-space 0 \
---max-seq-len-to-capture 512 --max-num-seqs 1 --api-key "token-abc123" --max-model-len 8000 \
---trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024
 ```
 ## A method to merge adapter weights to the base model and quantize

 ## Loading this model with VLLM via docker
 ```
+docker run --runtime nvidia --gpus all \
+--env "HUGGING_FACE_HUB_TOKEN = .........." \
+-p 8000:8000 \
 --ipc=host --model jsbaicenter/Llama-3.3-70b-Instruct-AWQ-4BIT-GEMM \
+--gpu-memory-utilization 0.9 \
+--swap-space 0 \
+--max-seq-len-to-capture 512 \
+--max-num-seqs 1 \
+--api-key "token-abc123" \
+--max-model-len 8000 \
+--trust-remote-code --enable-chunked-prefill \
+--max_num_batched_tokens 1024
 ```
 ## A method to merge adapter weights to the base model and quantize