Update README.md
Browse files
README.md
CHANGED
@@ -47,8 +47,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
|
|
47 |
|
48 |
Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
|
49 |
In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
|
50 |
-
|
51 |
-
|
52 |
|
53 |
|
54 |
<p align="center">
|
|
|
47 |
|
48 |
Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
|
49 |
In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
|
50 |
+
* On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
|
51 |
+
* With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
|
52 |
|
53 |
|
54 |
<p align="center">
|