|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- code |
|
library_name: transformers |
|
tags: |
|
- causal-lm |
|
- moe |
|
- mixture-of-experts |
|
- qwen |
|
- distillation |
|
- svd |
|
- lora-merged |
|
- code-generation |
|
base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct |
|
--- |
|
|
|
# A SVD based Distillation of Qwen3-Coder-480B for better code generation |
|
|
|
## Model Description |
|
|
|
This model is a distilled version of **`Qwen/Qwen3-Coder-30B-A3B-Instruct`** designed to achieve coding and reasoning capabilities approaching those of a much larger teacher model. |
|
|
|
It is the result of applying a LoRA made via a SVD distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a **62-layer, 160-expert teacher model** into the more efficient **48-layer, 128-expert architecture** of the `Qwen3-Coder-30b-a3b` student model. |
|
|
|
The primary goal was to significantly enhance performance on **complex coding tasks**, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical. |
|
|
|
## The Distillation Methodology |
|
|
|
This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the `SVD-based` script. This pipeline was designed to ensure maximum precision and knowledge transfer. |
|
|
|
### Core Components |
|
|
|
* **Teacher Model:** 'Qwen/Qwen3-Coder-480B-A35B-Instruct'. |
|
* **Student Model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`. |
|
* **LoRA Rank:** A high rank of **`r=2048`** was used for all modules to capture a very high degree of information from the teacher. |
|
|
|
### The Distillation Pipeline |
|
|
|
For each corresponding layer in the student and teacher, the following pipeline was executed: |
|
|
|
1. **Spherical Linear Interpolation (SLERP):** For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging. |
|
|
|
2. **Singular Value Decomposition (SVD) Projection:** The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (`U`, `S`, `V`). The **top 2048** most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity. |
|
|
|
3. **Procrustes Analysis:** After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference. |
|
|
|
4. **DARE (Drop and Rescale):** The difference tensor (`Distilled - Aligned Student`) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA. |
|
|
|
### Mixture-of-Experts (MoE) Distillation |
|
|
|
The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning. |
|
|
|
* **Expert Fingerprinting & Clustering:** To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." **K-Means clustering** was then used to group these 160 fingerprints into 128 distinct clusters. |
|
* **Expert-to-Expert Distillation:** Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred. |
|
* **Router Gate Distillation:** The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic. |
|
|
|
## Intended Use |
|
|
|
This model is intended for **code generation**. It should be better at tasks that require understanding complex logic, algorithms, and software architecture. |
|
|
|
* **Primary Use:** Code generation, refactoring, explanation (although since its an instruct it may not be perfect for explaining things), and debugging. |
|
* **Out of Scope:** This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks. |
|
|