Lab1806 commited on
Commit
e1428b5
·
verified ·
1 Parent(s): a9aa2e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -9
README.md CHANGED
@@ -9,12 +9,6 @@ library_name: transformers
9
  ---
10
  # Welcome to FairyR1-32B created by PKU-DS-LAB!
11
 
12
- ## Introduction
13
-
14
- FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
15
-
16
- <!-- ## Evaluation -->
17
-
18
  | Benchmark | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B | FairyR1-32B (PKU) |
19
  | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
20
  | **AIME 2024 (Math)** | 79.8 | 72.6 | **80.4** |
@@ -22,9 +16,10 @@ FairyR1-32B, a highly efficient large-language-model (LLM) that matches or excee
22
  | **LiveCodeBench (Code)** | 65.9 | 57.2 | **67.7** |
23
  | **GPQA-Diamond (Sci-QA)** | **71.5** | 62.1 | 60.0 |
24
 
25
- - AIME 2024/2025 (math): We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
26
- - [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite): We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
27
- - [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa): We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
 
28
 
29
  ## Model Details
30
 
@@ -63,6 +58,12 @@ This work demonstrates the feasibility of significantly reducing model size and
63
  - **Hours used(Coding):** 1.5h
64
  - **Model Merging:** about 40min on CPU, no GPU needed.
65
 
 
 
 
 
 
 
66
 
67
  ## FairyR1 series Team Members:
68
 
 
9
  ---
10
  # Welcome to FairyR1-32B created by PKU-DS-LAB!
11
 
 
 
 
 
 
 
12
  | Benchmark | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B | FairyR1-32B (PKU) |
13
  | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
14
  | **AIME 2024 (Math)** | 79.8 | 72.6 | **80.4** |
 
16
  | **LiveCodeBench (Code)** | 65.9 | 57.2 | **67.7** |
17
  | **GPQA-Diamond (Sci-QA)** | **71.5** | 62.1 | 60.0 |
18
 
19
+ ## Introduction
20
+
21
+ FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
22
+
23
 
24
  ## Model Details
25
 
 
58
  - **Hours used(Coding):** 1.5h
59
  - **Model Merging:** about 40min on CPU, no GPU needed.
60
 
61
+ ### Evaluation Set
62
+
63
+ - AIME 2024/2025 (math): We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
64
+ - [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite): We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
65
+ - [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa): We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
66
+
67
 
68
  ## FairyR1 series Team Members:
69