10k you say...
Yet the config says 4096. Should i just change to 10240 or use scaling?
Linear rope scaling (8x). You shouldn't need to change the config: that's for the original model which was Llama-2 based & had a 4K context. It is a 32k model as the title indicates (hence the 8x scaling), as described on the original model page. The 10k comment is just a specific test. You should use 8x scaling even for < 32K context. There's more information in the main model page.
Also, this is a really old model! You should use a different one, and the 2.4bpw is pretty bad for long context in particular (anything below 4 bit sadly is).
Ah okay, sorry im not well versed in ML nor computer science really! Well your non-GPT prose sold it to me! And i was under the impression getting a large model + low quant is better than smaller model with less quant.
I havent used this for a while so i cant remember how it performs for RP. I've been using darkidol llama 3, its okay i guess. Any recommendations?
It has been many months, and that's an eternity in this field! I wouldn't recommend this model anymore, look for Llama-3 ones maybe? Unfortunately I don't know the latest & best for RP. Perhaps the UGO leaderboard will help you.
In my testing, large model quantized down can seem better than a smaller model, but you tend to rapidly loose coherence with larger context lengths. If that's important to you, smaller models can do better. Usually anything under 4 bits, unless very carefully tuned or calibrated, is going to exhibit significant long-context degradation. There aren't currently any non-trivial long-context leaderboards I know of sadly (NIAH does not count), so you have to try out different models if that is important to you.
I believe the community fine-tunes of Mistral-Large (Behemoth, Magnum, merges) can all produce text of similar quality to Aurelian, and are a lot smarter (but also bigger). They are pretty good up to about 8K of context, perhaps longer, but eventually I think the fundamental weakness of Mistral-Large at long-context starts to show.
If Llama-3 wrote better prose out of the box, I'd have jumped on it much sooner with an Aurelian variant. Instead, I've been focusing on accuracy and consistency. The fine-tuned 405B is a beast for reasoning at >100K context, better than any closed source model I've tried.
Thank you for your incites.
Yeah i have turned to smaller models at bf16, but right now i'm trying out Yi-34B-200K-DARE-megamerge-v8-4.0bpw-h6-exl2, so not 70B nor 2.4bpm!
Yeah i expect the 405B would be good. I dont think ill ever be running anything other than consumer GPUs so, at the moment, 24gb cap. I'll checkout the leader board and add to my already long list of llm to check out.