vllm

Any plans for some tiny models? (<4b)

#2
by phly95 - opened

Mistral models are great, but they are unfortunately lacking models <4b. Options like Qwen 2.5, Gemma 2 2b, and Llama 3.2 3b and 1b exist, but I feel like having a Mistral model in that area would really make deploying local llm powered apps a lot easier, especially if deploying to basic laptops in a workplace (good luck convincing IT to deploy Nvidia laptops to an entire company).

Were you thinking of quantizing the models to have them run on the lesser hardware?

I had the idea that it would be quantized as gguf to run it on standard issue company laptops using llama.cpp. I think it would be a cool way to circulate llm-based application within specific departments while reducing overhead of setting up servers or managing private information.

That sounds like a good idea. But is it that these larger models are too big to run in CPU memory even with quantizing? Also I feel like the actual storage of the models would be a problem too if the issued laptops dont have very much space.

The problem is responsiveness mainly. While the larger models can run on these laptops since they've got plenty of unified memory, the GPUs have a hard time once you throw a decent amount of context at it. If the goal is to have a complete summary of a transcript within 1 minute (since the computer slows to a crawl during processing), then these smaller models are the only option, at least when using llama.cpp with the Vulkan backend that is.

Sign up or log in to comment