NOTE: DEPRECIATED, BETTER PEOPLE DO THIS NOW
LLaMa 65B converted to ggml via LLaMa.cpp, then quantized to 4bit.
Legacy is for llama.cpp setups older than https://github.com/ggerganov/llama.cpp/pull/1508, the regular is faster but does not work on old versions.
I recommend the following settings when running as a good starting point:
main.exe -m ggml-LLaMa-65B-q4_0.bin -n -1 -t 32 -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first --color
Be aware that LLaMa is a text generation model, not a conversational one, and as such you will have to prompt it differently than, for example, Vicuna or ChatGPT.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.