Heralax
/

llama-gRPo-thoughtprocess

Model card Files Files and versions Community

Heralax commited on Jun 7

Commit

f39ee02

·

verified ·

1 Parent(s): a210d0d

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -24,4 +24,6 @@ Using the hardcoded system prompt prefix is heavily encouraged.
 Typical min P settings seem to work alright, though on some sampling params repeitition is observed, be careful and experiment a bit.
 Try using [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.

 Typical min P settings seem to work alright, though on some sampling params repeitition is observed, be careful and experiment a bit.
+Fundamentally this is an experimental method applied to a slightly-continually-trained Mistral 7b v0.2, due to the agedness of its base it might lack some of the raw intelligence of newer models.
 Try using [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.