BasedBase commited on
Commit
ad4186f
·
1 Parent(s): 616d088

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -3
README.md CHANGED
@@ -1,3 +1,107 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - code
6
+ library_name: transformers
7
+ tags:
8
+ - causal-lm
9
+ - moe
10
+ - mixture-of-experts
11
+ - qwen
12
+ - distillation
13
+ - svd
14
+ - lora-merged
15
+ - code-generation
16
+ base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
17
+ ---
18
+
19
+ # [Model Name]: A High-Fidelity Distillation of Qwen3-Coder for Advanced Code Generation
20
+
21
+ ## Model Description
22
+
23
+ This model is a high-fidelity, distilled version of **`Qwen/Qwen3-Coder-30B-A3B-Instruct`** designed to achieve coding and reasoning capabilities approaching those of a much larger, private teacher model.
24
+
25
+ It is the result of applying a sophisticated LoRA, generated via a unique distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a **62-layer, 160-expert teacher model** into the more efficient **48-layer, 128-expert architecture** of the `Qwen3-Coder` student model.
26
+
27
+ The primary goal was to significantly enhance performance on **complex coding tasks**, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.
28
+
29
+ ## The Distillation Methodology
30
+
31
+ This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the `moe_distill_gpu_v15_FINAL` script. This pipeline was designed to ensure maximum precision and knowledge transfer.
32
+
33
+ ### Core Components
34
+
35
+ * **Teacher Model:** A private 62-layer, 160-expert/layer Qwen model.
36
+ * **Student Model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`.
37
+ * **LoRA Rank:** A high rank of **`r=2048`** was used for all modules to capture a very high degree of information from the teacher.
38
+
39
+ ### The Distillation Pipeline
40
+
41
+ For each corresponding layer in the student and teacher, the following pipeline was executed:
42
+
43
+ 1. **Spherical Linear Interpolation (SLERP):** For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.
44
+
45
+ 2. **Singular Value Decomposition (SVD) Projection:** The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (`U`, `S`, `V`). The **top 2048** most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.
46
+
47
+ 3. **Procrustes Analysis:** After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.
48
+
49
+ 4. **DARE (Drop and Rescale):** The difference tensor (`Distilled - Aligned Student`) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.
50
+
51
+ ### Mixture-of-Experts (MoE) Distillation
52
+
53
+ The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.
54
+
55
+ * **Expert Fingerprinting & Clustering:** To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." **K-Means clustering** was then used to group these 160 fingerprints into 128 distinct clusters.
56
+ * **Expert-to-Expert Distillation:** Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
57
+ * **Router Gate Distillation:** The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.
58
+
59
+ ## Intended Use
60
+
61
+ This model is intended for **advanced code generation and reasoning**. It excels at tasks that require understanding complex logic, algorithms, and software architecture.
62
+
63
+ * **Primary Use:** Code generation, refactoring, explanation, and debugging.
64
+ * **Out of Scope:** This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.
65
+
66
+ ## How to Use
67
+
68
+ The model can be used with the standard `transformers` library pipeline. It uses the ChatML prompt format.
69
+
70
+ ```python
71
+ import torch
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer
73
+
74
+ # Make sure to replace with your Hugging Face username and the model name
75
+ model_name = "[Your Hugging Face Username]/[Model Name]"
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ model_name,
80
+ torch_dtype=torch.bfloat16,
81
+ device_map="auto"
82
+ )
83
+
84
+ # Example of a complex coding prompt using the ChatML format
85
+ prompt = """
86
+ <|im_start|>system
87
+ You are a helpful programming assistant.
88
+ <|im_end|>
89
+ <|im_start|>user
90
+ Here is a Python function that calculates Fibonacci numbers using recursion. It is very inefficient for large numbers.
91
+
92
+ def fibonacci_recursive(n):
93
+ if n <= 1:
94
+ return n
95
+ else:
96
+ return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)
97
+
98
+ Please refactor this function to be iterative using a stack-based approach. Add comments to explain the logic.
99
+ <|im_end|>
100
+ <|im_start|>assistant
101
+ """
102
+
103
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
104
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
105
+
106
+ response = tokenizer.decode(outputs, skip_special_tokens=True)
107
+ print(response.split("<|im_start|>assistant\n")[-1])