Lab1806 commited on
Commit
a9aa2e6
·
verified ·
1 Parent(s): b08b2a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -23
README.md CHANGED
@@ -13,6 +13,19 @@ library_name: transformers
13
 
14
  FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Model Details
17
 
18
  The FairyR1 model represents a further exploration of our earlier work [TinyR1](https://arxiv.org/pdf/2503.04872), retaining the core “Branch-Merge Distillation” approach while introducing refinements in data processing and model architecture.
@@ -21,7 +34,16 @@ In this effort, we overhauled the distillation data pipeline: raw examples from
21
 
22
  On the modeling side, rather than training three separate specialists as before, we limited our scope to just two domain experts (math and code), each trained independently under identical hyperparameters (e.g., learning rate and batch size) for about five epochs. We then fused these experts into a single 32B-parameter model using the [AcreeFusion](https://arxiv.org/pdf/2403.13257) tool. By streamlining both the data distillation workflow and the specialist-model merging process, FairyR1 achieves task-competitive results with only a fraction of the parameters and computational cost of much larger models.
23
 
24
- ### Model Description
 
 
 
 
 
 
 
 
 
25
 
26
  - **Developed by:** PKU-DS-LAB
27
  - **Model type:** Reasoning Model
@@ -32,7 +54,7 @@ On the modeling side, rather than training three separate specialists as before,
32
  ### Training Data
33
 
34
  - **Math:** 6.6k CoT trajectories from [AI-MO/NuminaMath-1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5), default subset
35
- - **Coding:** 3.8k CoT trajectories [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), coding subset
36
 
37
  ### Hardware Utilization
38
 
@@ -41,27 +63,6 @@ On the modeling side, rather than training three separate specialists as before,
41
  - **Hours used(Coding):** 1.5h
42
  - **Model Merging:** about 40min on CPU, no GPU needed.
43
 
44
- ### Evaluation
45
-
46
- | Benchmark | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B | FairyR1-32B (PKU) |
47
- | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
48
- | **AIME 2024 (Math)** | 79.8 | 72.6 | **80.4** |
49
- | **AIME 2025 (Math)** | 70.0 | 52.9 | **75.6** |
50
- | **LiveCodeBench (Code)** | 65.9 | 57.2 | **67.7** |
51
- | **GPQA-Diamond (Sci-QA)** | **71.5** | 62.1 | 60.0 |
52
-
53
- - AIME 2024/2025 (math): We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
54
- - [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite): We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
55
- - [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa): We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
56
-
57
-
58
- ## Result Analysis and Key Contributions:
59
-
60
- From the test results, FairyR1 scored slightly higher than DeepSeek-R1-671B on the AIME 2025 and LiveCodeBench benchmarks, and performed comparably on AIME 2024.
61
-
62
- These results indicate that, by building on the DeepSeek‑R1‑Distill‑Qwen‑32B base and applying targeted techniques, FairyR1 achieves comparable or slightly superior performance in mathematical and programming domains using only about 5% of the parameter count of much larger models, although performance gaps may remain in other fields such as scientific question answering.
63
-
64
- This work demonstrates the feasibility of significantly reducing model size and potential inference cost through optimized data processing and model fusion techniques while maintaining strong task-specific performance.
65
 
66
  ## FairyR1 series Team Members:
67
 
 
13
 
14
  FairyR1-32B, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks despite using only ~5% of their parameters. Built atop the DeepSeek-R1-Distill-Qwen-32B base, FairyR1-32B leverages a novel “distill-and-merge” pipeline—combining task-focused fine-tuning with model-merging techniques to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.
15
 
16
+ <!-- ## Evaluation -->
17
+
18
+ | Benchmark | DeepSeek-R1-671B | DeepSeek-R1-Distill-Qwen-32B | FairyR1-32B (PKU) |
19
+ | :-----------------------: | :--------------: | :--------------------------: | :-----------------------: |
20
+ | **AIME 2024 (Math)** | 79.8 | 72.6 | **80.4** |
21
+ | **AIME 2025 (Math)** | 70.0 | 52.9 | **75.6** |
22
+ | **LiveCodeBench (Code)** | 65.9 | 57.2 | **67.7** |
23
+ | **GPQA-Diamond (Sci-QA)** | **71.5** | 62.1 | 60.0 |
24
+
25
+ - AIME 2024/2025 (math): We evaluate 32 times and report the average accuracy. [AIME 2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) contains 30 problems. [AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) consists of Part I and Part II, with a total of 30 questions.<br>
26
+ - [LiveCodeBench (code)](https://huggingface.co/datasets/livecodebench/code_generation_lite): We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.<br>
27
+ - [GPQA-Diamond (Sci-QA)](https://huggingface.co/datasets/Idavidrein/gpqa): We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.<br>
28
+
29
  ## Model Details
30
 
31
  The FairyR1 model represents a further exploration of our earlier work [TinyR1](https://arxiv.org/pdf/2503.04872), retaining the core “Branch-Merge Distillation” approach while introducing refinements in data processing and model architecture.
 
34
 
35
  On the modeling side, rather than training three separate specialists as before, we limited our scope to just two domain experts (math and code), each trained independently under identical hyperparameters (e.g., learning rate and batch size) for about five epochs. We then fused these experts into a single 32B-parameter model using the [AcreeFusion](https://arxiv.org/pdf/2403.13257) tool. By streamlining both the data distillation workflow and the specialist-model merging process, FairyR1 achieves task-competitive results with only a fraction of the parameters and computational cost of much larger models.
36
 
37
+ ## Result Analysis and Key Contributions:
38
+
39
+ From the test results, FairyR1 scored slightly higher than DeepSeek-R1-671B on the AIME 2025 and LiveCodeBench benchmarks, and performed comparably on AIME 2024.
40
+
41
+ These results indicate that, by building on the DeepSeek‑R1‑Distill‑Qwen‑32B base and applying targeted techniques, FairyR1 achieves comparable or slightly superior performance in mathematical and programming domains using only about 5% of the parameter count of much larger models, although performance gaps may remain in other fields such as scientific question answering.
42
+
43
+ This work demonstrates the feasibility of significantly reducing model size and potential inference cost through optimized data processing and model fusion techniques while maintaining strong task-specific performance.
44
+
45
+
46
+ ## Model Description
47
 
48
  - **Developed by:** PKU-DS-LAB
49
  - **Model type:** Reasoning Model
 
54
  ### Training Data
55
 
56
  - **Math:** 6.6k CoT trajectories from [AI-MO/NuminaMath-1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5), default subset
57
+ - **Coding:** 3.8k CoT trajectories from [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), coding subset
58
 
59
  ### Hardware Utilization
60
 
 
63
  - **Hours used(Coding):** 1.5h
64
  - **Model Merging:** about 40min on CPU, no GPU needed.
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## FairyR1 series Team Members:
68