Heralax
/

llama-gRPo-thoughtprocess

Model card Files Files and versions Community

Heralax commited on Jun 7

Commit

03d0612

·

verified ·

1 Parent(s): f39ee02

Update README.md

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -1,3 +1,6 @@
 *(Pronounced "Gee RP Oh". The name is a sort-of pun because it was aligned with the GRPO algorithm, but is for RP (roleplay). Therefore, gRPo.)*
 This is an experimental proof of concept model trained with [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline. The Reinforcement Learning done attempted to maximize the amount of emotion that the model wrote with.
@@ -26,4 +29,7 @@ Typical min P settings seem to work alright, though on some sampling params repe
 Fundamentally this is an experimental method applied to a slightly-continually-trained Mistral 7b v0.2, due to the agedness of its base it might lack some of the raw intelligence of newer models.
-Try using [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.

+---
+license: llama3.1
+---
 *(Pronounced "Gee RP Oh". The name is a sort-of pun because it was aligned with the GRPO algorithm, but is for RP (roleplay). Therefore, gRPo.)*
 This is an experimental proof of concept model trained with [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline. The Reinforcement Learning done attempted to maximize the amount of emotion that the model wrote with.
 Fundamentally this is an experimental method applied to a slightly-continually-trained Mistral 7b v0.2, due to the agedness of its base it might lack some of the raw intelligence of newer models.
+Try using [Augmentoolkit's](https://github.com/e-p-armstrong/augmentoolkit) GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.
+Q: Why the Llama license?
+A: The Deepseek Llama Distil model was used as the quality grader. I am not sure if this actually means the license has to kick in, since the model's outputs were not used to make this one directly. But, caution.