PKU-DS-LAB
/

FairyR1-32B

@@ -9,12 +9,6 @@ library_name: transformers
 ---
 # Welcome to FairyR1-32B created by PKU-DS-LAB!
-## Introduction
-FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
-<!-- ## Evaluation -->
 | Benchmark                 | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B |      FairyR1-32B (PKU)    |
 | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
 |   **AIME 2024 (Math)**    |       79.8       |             72.6             |          **80.4**         |
@@ -22,9 +16,10 @@ FairyR1-32B, a highly efficient large-language-model (LLM) that matches or excee
 | **LiveCodeBench (Code)**  |       65.9       |             57.2             |          **67.7**         |
 | **GPQA-Diamond (Sci-QA)** |     **71.5**     |             62.1             |            60.0           |
-- AIME 2024/2025 (math):  We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
-- [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite):  We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
-- [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa):  We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
 ## Model Details
@@ -63,6 +58,12 @@ This work demonstrates the feasibility of significantly reducing model size and
 - **Hours used(Coding):** 1.5h
 - **Model Merging:** about 40min on CPU, no GPU needed.
 ## FairyR1 series Team Members:

 ---
 # Welcome to FairyR1-32B created by PKU-DS-LAB!
 | Benchmark                 | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B |      FairyR1-32B (PKU)    |
 | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
 |   **AIME 2024 (Math)**    |       79.8       |             72.6             |          **80.4**         |
 | **LiveCodeBench (Code)**  |       65.9       |             57.2             |          **67.7**         |
 | **GPQA-Diamond (Sci-QA)** |     **71.5**     |             62.1             |            60.0           |
+## Introduction
+FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
 ## Model Details
 - **Hours used(Coding):** 1.5h
 - **Model Merging:** about 40min on CPU, no GPU needed.
+### Evaluation Set
+- AIME 2024/2025 (math):  We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
+- [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite):  We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
+- [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa):  We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
 ## FairyR1 series Team Members: