Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,107 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- code
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- causal-lm
|
9 |
+
- moe
|
10 |
+
- mixture-of-experts
|
11 |
+
- qwen
|
12 |
+
- distillation
|
13 |
+
- svd
|
14 |
+
- lora-merged
|
15 |
+
- code-generation
|
16 |
+
base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
|
17 |
+
---
|
18 |
+
|
19 |
+
# [Model Name]: A High-Fidelity Distillation of Qwen3-Coder for Advanced Code Generation
|
20 |
+
|
21 |
+
## Model Description
|
22 |
+
|
23 |
+
This model is a high-fidelity, distilled version of **`Qwen/Qwen3-Coder-30B-A3B-Instruct`** designed to achieve coding and reasoning capabilities approaching those of a much larger, private teacher model.
|
24 |
+
|
25 |
+
It is the result of applying a sophisticated LoRA, generated via a unique distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a **62-layer, 160-expert teacher model** into the more efficient **48-layer, 128-expert architecture** of the `Qwen3-Coder` student model.
|
26 |
+
|
27 |
+
The primary goal was to significantly enhance performance on **complex coding tasks**, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.
|
28 |
+
|
29 |
+
## The Distillation Methodology
|
30 |
+
|
31 |
+
This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the `moe_distill_gpu_v15_FINAL` script. This pipeline was designed to ensure maximum precision and knowledge transfer.
|
32 |
+
|
33 |
+
### Core Components
|
34 |
+
|
35 |
+
* **Teacher Model:** A private 62-layer, 160-expert/layer Qwen model.
|
36 |
+
* **Student Model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`.
|
37 |
+
* **LoRA Rank:** A high rank of **`r=2048`** was used for all modules to capture a very high degree of information from the teacher.
|
38 |
+
|
39 |
+
### The Distillation Pipeline
|
40 |
+
|
41 |
+
For each corresponding layer in the student and teacher, the following pipeline was executed:
|
42 |
+
|
43 |
+
1. **Spherical Linear Interpolation (SLERP):** For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.
|
44 |
+
|
45 |
+
2. **Singular Value Decomposition (SVD) Projection:** The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (`U`, `S`, `V`). The **top 2048** most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.
|
46 |
+
|
47 |
+
3. **Procrustes Analysis:** After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.
|
48 |
+
|
49 |
+
4. **DARE (Drop and Rescale):** The difference tensor (`Distilled - Aligned Student`) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.
|
50 |
+
|
51 |
+
### Mixture-of-Experts (MoE) Distillation
|
52 |
+
|
53 |
+
The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.
|
54 |
+
|
55 |
+
* **Expert Fingerprinting & Clustering:** To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." **K-Means clustering** was then used to group these 160 fingerprints into 128 distinct clusters.
|
56 |
+
* **Expert-to-Expert Distillation:** Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
|
57 |
+
* **Router Gate Distillation:** The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.
|
58 |
+
|
59 |
+
## Intended Use
|
60 |
+
|
61 |
+
This model is intended for **advanced code generation and reasoning**. It excels at tasks that require understanding complex logic, algorithms, and software architecture.
|
62 |
+
|
63 |
+
* **Primary Use:** Code generation, refactoring, explanation, and debugging.
|
64 |
+
* **Out of Scope:** This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.
|
65 |
+
|
66 |
+
## How to Use
|
67 |
+
|
68 |
+
The model can be used with the standard `transformers` library pipeline. It uses the ChatML prompt format.
|
69 |
+
|
70 |
+
```python
|
71 |
+
import torch
|
72 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
73 |
+
|
74 |
+
# Make sure to replace with your Hugging Face username and the model name
|
75 |
+
model_name = "[Your Hugging Face Username]/[Model Name]"
|
76 |
+
|
77 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
78 |
+
model = AutoModelForCausalLM.from_pretrained(
|
79 |
+
model_name,
|
80 |
+
torch_dtype=torch.bfloat16,
|
81 |
+
device_map="auto"
|
82 |
+
)
|
83 |
+
|
84 |
+
# Example of a complex coding prompt using the ChatML format
|
85 |
+
prompt = """
|
86 |
+
<|im_start|>system
|
87 |
+
You are a helpful programming assistant.
|
88 |
+
<|im_end|>
|
89 |
+
<|im_start|>user
|
90 |
+
Here is a Python function that calculates Fibonacci numbers using recursion. It is very inefficient for large numbers.
|
91 |
+
|
92 |
+
def fibonacci_recursive(n):
|
93 |
+
if n <= 1:
|
94 |
+
return n
|
95 |
+
else:
|
96 |
+
return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)
|
97 |
+
|
98 |
+
Please refactor this function to be iterative using a stack-based approach. Add comments to explain the logic.
|
99 |
+
<|im_end|>
|
100 |
+
<|im_start|>assistant
|
101 |
+
"""
|
102 |
+
|
103 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
104 |
+
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
|
105 |
+
|
106 |
+
response = tokenizer.decode(outputs, skip_special_tokens=True)
|
107 |
+
print(response.split("<|im_start|>assistant\n")[-1])
|