Text Generation
Transformers
Safetensors
starcoder2
code
Eval Results
text-generation-inference
8-bit precision
compressed-tensors

question about deploying the model using containers

#1
by rthenamvar - opened

Hi, I hope everything is great for you. I'm trying to deploy this model using Docker containers. The model is deployed successfully and generates prompts, but is really slow compared to other models currently deployed (Qwen2.5-7B-Instruct-GPTQ-Int4). So I'm guessing that there may be something wrong with my config which is here:

--model neuralmagic/starcoder2-7b-quantized.w8a8
--disable-log-requests
--enable-prefix-caching
--use-v2-block-manager
--max_num_batched_tokens 32000
--disable-sliding-window
--block-size 32
--max-num-seqs 600"

I would appreciate any kind of help

Sign up or log in to comment