BasedBase-Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 - MLX 4-bit Quantization

A massive and gentlemanly thank you to the original author BasedBase for creating this incredible model. This is a 4-bit quantized version of the original Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32 model, optimized for Apple Silicon with MLX.

All of my additions and modifications are detailed below. The original, highly-detailed model card from BasedBase can be found further down this page.

My Contributions & Modifications

MLX Quantization

This version of the model has been quantized to 4-bit precision using the MLX framework, making it incredibly efficient to run on Apple Silicon devices.

Framework: MLX
Quantization: 4-bit
Performance: Blazing fast! From my limited testing, you can expect speeds of 70-90 tokens per second on an M4 Pro Mac.

LM Studio Configuration & A Little Hackery...

To get this model purring perfectly with tool-calling in LM Studio, a little creative problem-solving was required.

I'm not a big Qwen guy, so I re-used a prompt template I knew worked with my last Gemma 3 MLX quant and I adapted it. Hey, if it works, it works! 😉

This workaround involved modifying the .jinja prompt template to ensure native tool-calling compatibility. Because of this, a few extra steps are needed for optimal performance:

Additional Stop Strings: Custom stop strings are necessary to prevent the model from generating unwanted text.
Reinforcing System Prompt: A specific system prompt helps guide the model's behavior.

To make your life easier, I've included an LM Studio preset (.preset.json file) in this repository. This preset includes the correct stop strings and a well-tuned sampling/generation configuration. Just load it up, and you're good to go!

Original Model Card from BasedBase

(The following is the original information provided by the model's creator.)

Model Description

This model is a distilled version of Qwen/Qwen3-Coder-30B-A3B-Instruct designed to achieve coding and reasoning capabilities approaching those of a much larger teacher model.

It is the result of applying a LoRA made via a SVD distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 160-expert teacher model into the more efficient 48-layer, 128-expert architecture of the Qwen3-Coder-30b-a3b student model.

The primary goal was to significantly enhance performance on complex coding tasks, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.

The Distillation Methodology

This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the SVD-based script. This pipeline was designed to ensure maximum precision and knowledge transfer.

Core Components

Teacher Model: 'Qwen/Qwen3-Coder-480B-A35B-Instruct'.
Student Model: Qwen/Qwen3-Coder-30B-A3B-Instruct.
LoRA Rank: A high rank of r=2048 was used for all modules to capture a very high degree of information from the teacher.

The Distillation Pipeline

For each corresponding layer in the student and teacher, the following pipeline was executed:

Spherical Linear Interpolation (SLERP): For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.
Singular Value Decomposition (SVD) Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (U, S, V). The top 2048 most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.
Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.
DARE (Drop and Rescale): The difference tensor (Distilled - Aligned Student) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.

Mixture-of-Experts (MoE) Distillation

The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.

Expert Fingerprinting & Clustering: To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." K-Means clustering was then used to group these 160 fingerprints into 128 distinct clusters.
Expert-to-Expert Distillation: Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
Router Gate Distillation: The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.

Intended Use

This model is intended for code generation. It should be better at tasks that require understanding complex logic, algorithms, and software architecture.

Primary Use: Code generation, refactoring, explanation (although since its an instruct it may not be perfect for explaining things), and debugging.
Out of Scope: This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.

Downloads last month: 9

Safetensors

Model size

30.5B params

Tensor type

BF16

U32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BennyDaBall/BasedBase-Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-MLX-4bit

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Finetuned

BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32

Quantized

(8)

this model