When Does Sparsity Mitigate the Curse of Depth in LLMs
Abstract
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.
Community
Revealing sparsity as an intrinsic variance regulator that unlocks effective depth utilization
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity (2026)
- Post-LayerNorm Is Back: Stable, ExpressivE, and Deep (2026)
- Improving MoE Compute Efficiency by Composing Weight and Data Sparsity (2026)
- Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers (2026)
- HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing (2026)
- ActTail: Global Activation Sparsity in Large Language Models (2026)
- Deterministic Differentiable Structured Pruning for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper