mx1 commited on
Commit
5206936
·
verified ·
1 Parent(s): 032e97b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -47,8 +47,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
47
 
48
  Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
49
  In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
50
- On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
51
- With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
52
 
53
 
54
  <p align="center">
 
47
 
48
  Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
49
  In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
50
+ * On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
51
+ * With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
52
 
53
 
54
  <p align="center">