arxiv:2503.17500

Variance Control via Weight Rescaling in LLM Pre-training

Published on Mar 21

· Submitted by

louisowen6 on Mar 25

Upvote

Authors:

Louis Owen ,

Abhay Kumar ,

Nilabhra Roy Chowdhury ,

Fabian Güra

Abstract

The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.

AI-generated summary

The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.

View arXiv page View PDF GitHub 5 Add to collection

Community

louisowen6

Paper author Paper submitter Mar 25

•

edited Mar 25

🚀 Controlling Weight Variance for Better LLM Performance 🚀

We trained over 𝟰𝟬 𝗼𝗻𝗲-𝗯𝗶𝗹𝗹𝗶𝗼𝗻-𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗟𝗟𝗮𝗠𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝟭𝟬𝟬 𝗕𝗶𝗹𝗹𝗶𝗼𝗻 𝗧𝗼𝗸𝗲𝗻𝘀 and discovered that 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗶𝗻𝗴 𝘄𝗲𝗶𝗴𝗵𝘁 𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 𝗮𝘁 𝗶𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗱𝘂𝗿𝗶𝗻𝗴 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 is crucial for improving downstream task performance—leading to gains of up to 𝟰.𝟲% 𝗼𝗻 𝗰𝗼𝗺𝗺𝗼𝗻 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀! 📈

To achieve this, we introduce:
✅ Layer Index Rescaling (LIR) – a weight initialization scheme
✅ Target Variance Rescaling (TVR) – a variance control strategy

Beyond performance gains, these techniques also help reduce extreme activation values, mitigating risks in quantization and low-precision training for LLMs.

@louisowen6 @akanyaani @nilabhra @gueraf

librarian-bot

Mar 26

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.17500 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17500 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17500 in a Space README.md to link it from this page.