Variance Control via Weight Rescaling in LLM Pre-training
Abstract
The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.
The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.
Community
🚀 Controlling Weight Variance for Better LLM Performance 🚀
We trained over 𝟰𝟬 𝗼𝗻𝗲-𝗯𝗶𝗹𝗹𝗶𝗼𝗻-𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗟𝗟𝗮𝗠𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝟭𝟬𝟬 𝗕𝗶𝗹𝗹𝗶𝗼𝗻 𝗧𝗼𝗸𝗲𝗻𝘀 and discovered that 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗶𝗻𝗴 𝘄𝗲𝗶𝗴𝗵𝘁 𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 𝗮𝘁 𝗶𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗱𝘂𝗿𝗶𝗻𝗴 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 is crucial for improving downstream task performance—leading to gains of up to 𝟰.𝟲% 𝗼𝗻 𝗰𝗼𝗺𝗺𝗼𝗻 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀! 📈
To achieve this, we introduce:
✅ Layer Index Rescaling (LIR) – a weight initialization scheme
✅ Target Variance Rescaling (TVR) – a variance control strategy
Beyond performance gains, these techniques also help reduce extreme activation values, mitigating risks in quantization and low-precision training for LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models (2025)
- HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization (2025)
- AdaGC: Improving Training Stability for Large Language Model Pretraining (2025)
- Binary Neural Networks for Large Language Model: A Survey (2025)
- Peri-LN: Revisiting Layer Normalization in the Transformer Architecture (2025)
- A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization (2025)
- Hyperspherical Normalization for Scalable Deep Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper