Update README.md

0619c83 10 days ago

4.25 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: transformers
	tags:
	- causal-lm
	- moe
	- mixture-of-experts
	- qwen
	- distillation
	- svd
	- lora-merged
	- code-generation
	base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
	---

	# A SVD based Distillation of Qwen3-Coder-480B for better code generation

	## Model Description

	This model is a distilled version of `Qwen/Qwen3-Coder-30B-A3B-Instruct` designed to achieve coding and reasoning capabilities approaching those of a much larger teacher model.

	It is the result of applying a LoRA made via a SVD distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 160-expert teacher model into the more efficient 48-layer, 128-expert architecture of the `Qwen3-Coder-30b-a3b` student model.

	The primary goal was to significantly enhance performance on complex coding tasks, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.

	## The Distillation Methodology

	This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the `SVD-based` script. This pipeline was designed to ensure maximum precision and knowledge transfer.

	### Core Components

	* Teacher Model: 'Qwen/Qwen3-Coder-480B-A35B-Instruct'.
	* Student Model: `Qwen/Qwen3-Coder-30B-A3B-Instruct`.
	* LoRA Rank: A high rank of `r=2048` was used for all modules to capture a very high degree of information from the teacher.

	### The Distillation Pipeline

	For each corresponding layer in the student and teacher, the following pipeline was executed:

	1. Spherical Linear Interpolation (SLERP): For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.

	2. Singular Value Decomposition (SVD) Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (`U`, `S`, `V`). The top 2048 most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.

	3. Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.

	4. DARE (Drop and Rescale): The difference tensor (`Distilled - Aligned Student`) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.

	### Mixture-of-Experts (MoE) Distillation

	The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.

	* Expert Fingerprinting & Clustering: To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." K-Means clustering was then used to group these 160 fingerprints into 128 distinct clusters.
	* Expert-to-Expert Distillation: Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
	* Router Gate Distillation: The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.

	## Intended Use

	This model is intended for code generation. It should be better at tasks that require understanding complex logic, algorithms, and software architecture.

	* Primary Use: Code generation, refactoring, explanation (although since its an instruct it may not be perfect for explaining things), and debugging.
	* Out of Scope: This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.