WEBGEN-4B-Preview-480B-Double-Distill

Model Description

This model was created by distilling the Qwen3-Coder-480B Mixture-of-Experts (MoE) teacher model into the compact and efficient Tesslate/WEBGEN-4B-Preview base, merging the lora, then redistilling that distill.

The process is the same but with a second distillation run blending all experts instead of just 64 and used a lower DARE-TIES drop rate. This was done to capture more information than what can be captured in one distillation. You should notice fewer errors and more functional code. Remember to be specific with prompting. It is a small model after all.

The purpose of this distill is to make the Webgen-4B-Preview model gain more of the knowledge of the teacher model to improve its overall performance. This model should perform better for web design but it is still a 4B model It is reccomended to use bf16 as its still only 8gb and because small models are very sensitive to quantization. For optimal results be specific in your prompting and avoid vaugue ambiguous prompts like "Create a website for a taco restaurant". Instead use prompts like "Make a single-file landing page for "RasterFlow" (GPU video pipeline). Style: modern tech, muted palette, Tailwind, rounded-xl, subtle gradients. Sections: navbar, hero (big headline + 2 CTAs), logos row, features (3x cards), code block (copyable), pricing (3 tiers), FAQ accordion, footer. Constraints: semantic HTML, no external JS. Return ONLY the HTML code."

The Distillation Process: In-Depth

The creation of this model was achieved through a novel SVD-based distillation pipeline, designed specifically to tackle the unique challenge of transferring knowledge from a sparse MoE architecture to a dense one. The process ensures maximum fidelity by intelligently selecting and blending expert knowledge rather than using naive averaging.

The methodology can be broken down into five key stages:

1. Non-Linear Layer Mapping

A direct linear mapping of layers (e.g., student layer 10 from teacher layer 10) is suboptimal. This pipeline uses a non-linear sigmoid mapping function to align the student and teacher layers. This ensures that the critical first and last layers of both models are perfectly aligned, while the intermediate layers are mapped along a smooth curve. This preserves the hierarchical feature development of the teacher model.

For student layers that fall between two integer teacher layers, Spherical Linear Interpolation (SLERP) is used on the teacher's weights to create a smooth, interpolated "virtual teacher layer" that accurately represents the knowledge at that specific depth.

2. MoE-to-Dense MLP Synthesis

This is the most critical and computationally intensive part of the process. Brute-force averaging all 256 experts from the teacher's MoE block for every layer would be computationally prohibitive and would dilute the specialized knowledge of each expert. Instead, a highly efficient two-pass intelligent selection method was used:

Pass 1: Centroid Calculation

First, the script creates a "fingerprint" for each of the 256 teacher experts in a given layer. This fingerprint is a flattened vector representation of all of an expert's weights. To find the "center of gravity" of the layer's knowledge, the script calculates the mean of all 256 fingerprints. This is done with a memory-efficient running sum entirely on the GPU to avoid VRAM OOMs and CPU-to-GPU transfer bottlenecks. The resulting average fingerprint is called the centroid.

Pass 2: Intelligent Expert Selection

The script then iterates through the 256 experts a second time. For each expert, it calculates its fingerprint and measures its Euclidean distance to the centroid. The experts closest to the centroid are the most representative of the layer's collective knowledge. The script selects the top 64 most representative experts for blending.

Final Blending

Finally, only the weights of these selected 64 experts are loaded. Each expert's weights are projected down to the student model's dimensions using Randomized SVD. These projected tensors are then averaged together to create the final, synthetic dense MLP layer. This synthesized layer captures the core knowledge of the teacher's MoE block without the noise from less relevant experts.

3. Delta Calculation and Purification

Once a synthetic teacher tensor (for either an attention or MLP block) is created, it is aligned with the student's corresponding original tensor using Generalized Procrustes Analysis. This rotates the teacher's tensor to best match the student's vector space without altering its internal structure.

The difference, or delta, is then calculated: delta = aligned_synthesized_teacher - original_student

This delta represents the new knowledge to be imparted. To ensure only the most impactful changes are kept, the DARE-TIES algorithm is applied to the delta, which prunes the 80% of values with the lowest magnitude and then rescales the remaining values to preserve the tensor's original norm.

4. LoRA Matrix Extraction

The final, purified delta tensor holds the essence of the teacher's wisdom. Singular Value Decomposition (SVD) is performed on this delta tensor to decompose it into its fundamental components. The most significant components are used to create temporary low-rank lora_A and lora_B matrices.

5. Merging for a Standalone Model

This is the final step. The temporary LoRA matrices (A and B) are multiplied together (B @ A) to reconstruct the purified delta. This delta is then added directly to the weights of the original student model. The resulting tensors are the new, "distilled" weights of the final model, which are saved as a complete, standalone model.

Intended Use & Limitations

Primary: Generate complete, single-file websites (landing pages, marketing pages, simple docs) with semantic HTML and Tailwind classes. Secondary: Component blocks (hero, pricing, FAQ) for manual composition. Limitations: It is still a 4B model so you need to be specific with your prompting to get good results. You may have to do a few rerolls to get the best results.

Distillation Procedure Details

The knowledge transfer was performed using the following configuration for the intermediate delta calculation on the first pass:

Teacher Model: 480B MoE model (Qwen3-Coder-480B)
Student Model: 4B Dense model (WEBGEN-4B-Preview)
Intermediate LoRA Rank: 2560
Intermediate LoRA Alpha: 2560
DARE-TIES Drop Rate: 0.80
Experts Blended per MLP: 64 out of 256

The knowledge transfer was performed using the following configuration for the intermediate delta calculation for the second pass:

Teacher Model: 480B MoE model (Qwen3-Coder-480B)
Student Model: 4B Dense model (WEBGEN-4B-Preview)
Intermediate LoRA Rank: 2560
Intermediate LoRA Alpha: 2560
DARE-TIES Drop Rate: 0.70
Experts Blended per MLP: 256 out of 256

Citation

@misc{tesslate_webgen_4b_preview_2025, title = {WEBGEN-4B-Preview: Design-first web generation with a 4B model}, author = {Tesslate Team}, year = {2025}, url = {https://huggingface.co/Tesslate/WEBGEN-4B-Preview} }