You changed my mind

#4
by SerialKicked - opened

Hi,

I have to admit that you changed my mind about MoE models. I've always been of the opinion that replacing a big ass model by a bunch of smaller ones was kind of an exercise in futility, if not completely counterproductive. I had the 'million monkeys with typewriters' analogy in my mind. But I have to revise my opinion after toying with this model.

I can't speak for its general ability to pass standardized tests, and I barely touched the RP side of it. But in 1:1 conversations, this beats the best Qwen 2.5 32B fine-tunes like nine times out of ten. Even with 32B models, the moment a conversation gets a bit complex, they will use the "repeat each user paragraph + add vague comment" type of response (in increasingly long walls of text if you don't edit them out), something your model does very little of.

Sure, It still occasionally shit the bed, as all models do, but not nearly as much as "normal" 8B would ever do.

General tests:

  • Online RAG / integration of web results into a coherent response: passed.
  • Chat summary, titling, and keyword association: passed.
  • Menu navigation: passed.
  • Ability to determine when it's relevant to initiate a chat: mixed (too eager to send message, but supported the provided formatting)
  • Complex system prompt and message obedience: passed.

Note: Can sometimes fixate on "older" parts of the prompt instead of what was posted recently. But I think it's more of a L3 thing than yours.

Just a note: I'm not sure I get your comment about enabling Flash Atttention in the model's page. FA, during inference doesn't change a model's output, beyond the rare rounding error. It's just a more (on average) cost-efficient way to process prompt / KV cache.

Cheers.

Thank you for your detailed feedback and notes. WOW.

Just an aside:

I have found that mastering MOE "Source" in f32, (as well as transfer) makes a big difference in total "moe" operation.
This seems to hold true regardless of the arch type of the model - Llama, Mistral, Deepseek and so on.

For this moe model:
Model selection was key here too. These models are "top of their class" in terms of fine tunes, from master fine tune makers.
It only takes one poor model choice / design in a MOE to bring down the operation of the entire model ; especially upon activation of all experts.

Still a lot to learn... and tinker with.

It's my pleasure. Oh right, i tried with all models enabled (and 16K context). I picked this MoE out of your list because I'm familiar with the models used and their authors. I probably tested most of them individually back when I was limited to 12GB of VRAM.

Out of curiosity, have you made a MoE with 12B models based on Mistral Nemo, or maybe planning to? I'd be kinda interested in seeing the results.

Last little thing, you have quite a list, and there's no way I'm going through all that. 😁 Is there a MoE you're particularly proud of?

I would say the new "Deep seek" moes I have built this week at top of the chart at the moment, but there is a lot more tweaking yet to go here.
Likewise the two I built are 4X8B ; larger, more powerful ones are coming.

RE: Moe Nemo
Not yet , reason: I have found pass-thru models of Nemo (Grand Gutenberg, Dark Universe) are more powerful. They are a little harder to use, but worth it.
Pass-thru models ("stacker") are a lot harder to build, but when you get them right there is nothing like them.
Grand Horror Series (Llama 3) also stand out.

However, based on what I have learned building Deepseek Moes, that might be about to change, as some of the insights from trial/error/testing
could be applied to all MOEs to make them function better.

A side: For Mistral Moes, :
https://huggingface.co/DavidAU/Mistral-MOE-4X7B-Dark-MultiVerse-Uncensored-Enhanced32-24B-gguf

This one is based on older 7B mistrals... but it is potent due to F32 precision and augmented quants.
There is still a lot of options to pushing the MOES to new levels.

I'll check your suggestions when I get more free time, thank you. Regarding R1-type models, I'm kinda waiting for the dust to settle a bit before looking into it (+ i need to fix my UI to work correctly for them first anyway). I also played with the MN-Darkest-Universe-29B for a little while. It's a very interesting model in its own right, could be fun to play with for creative endeavors. It's just way too wild/unreliable for my day-to-day use case. (yes, I've read your posts regarding sampling settings).

RE: Moe Nemo [...]
However, based on what I have learned building Deepseek Moes, that might be about to change, as some of the insights from trial/error/testing
could be applied to all MOEs to make them function better.

Great! I'll keep an eye out! :)

Cheers.

Sign up or log in to comment