inclusionAI
/

Ling-flash-2.0

Text Generation

Model card Files Files and versions

mx1 commited on 10 days ago

Commit

5206936

·

verified ·

1 Parent(s): 032e97b

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -47,8 +47,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
 Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
 In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
-● On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
-● With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
 <p align="center">

 Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
 In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
+* On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
+* With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
 <p align="center">