RLAIF/dpo_thinking_openorca_offtheshelf_improved_1e-6_0.02_1.7B_8B
Updated
RLAIF/dpo_thinking_openorca_offtheshelf_improved_1e-6_0.02_1.7B_1.7B
Updated
RLAIF/dpo_answer_openorca_offtheshelf_improved_1e-6_0.02_1.7B_4B
Updated
RLAIF/dpo_answer_openorca_offtheshelf_improved_1e-6_0.02_1.7B_1.7B
Updated
RLAIF/dpo_answer_openorca_offtheshelf_improved_1e-6_0.02_1.7B_0.6B
Updated
RLAIF/dpo_thinking_base_openorca_0.02_1.7B-4B
Updated
RLAIF/grpo_thinking_ultrafeedback-original_32_64_4_3e-3_2e-7_step-120_1.7B
2B
•
Updated
•
4
2B
•
Updated
•
2
2B
•
Updated
•
4
RLAIF/grpo_5e-7_4_1.7B-best
2B
•
Updated
•
6
RLAIF/Qwen3-1.7B_grpo_lr2e-7_n4_step30
2B
•
Updated
•
4
0.8B
•
Updated
•
6
RLAIF/llama-3b-open-r1-50k-sft
4B
•
Updated
•
5
Text Generation
•
8B
•
Updated
•
4
RLAIF/sft-llama-3.1-8b-external
Text Generation
•
8B
•
Updated
•
1
RLAIF/sft-gemma-2-9b-base-sft-llama-405b-instruct-correct-only-format-lr-5e-06-bs-64
Text Generation
•
9B
•
Updated
RLAIF/sft-llama8b-prm-800k-correct-only
Text Generation
•
8B
•
Updated
RLAIF/22-sequential-temp-0-verifier-no-best-oracle-in-context-train-8
8B
•
Updated
RLAIF/22-sequential-temp-0-verifier-oracle-in-context-train-8-w-error-masking
8B
•
Updated
RLAIF/15-w-error-masking-temp-0-verifier-in-context-train-in-context-inference-8-model
8B
•
Updated
•
4