Update README.md
Browse files
README.md
CHANGED
@@ -54,36 +54,42 @@ The pruned 15 B student is distilled from a calibrated 24 B teacher using a
|
|
54 |
### Results
|
55 |
Up to 40 % parameter reduction (24 B → 15 B) delivers 2× lower TTFT and ≈ 40 % higher tokens/s versus the uncompressed teacher while matching perplexity and divergence metrics—validating SimplePrune as an effective route to deploy KafkaLM in memory‑constrained, sparsity‑accelerated environments.
|
56 |
|
57 |
-
|
58 |
|--------|-------------|-----------------|---|
|
59 |
-
| Time‑to‑First‑Token | 4.91
|
60 |
-
| Prompts
|
61 |
-
| Tokens
|
|
|
62 |
|
63 |

|
64 |
|
65 |
### Training scalability (distillation run, MI300A cluster)
|
66 |
|
67 |
-
| Nodes | Tokens
|
68 |
|-------|------------|----------|
|
69 |
-
| 4 |
|
70 |
-
| 8 |
|
71 |
-
| 16 |
|
72 |
-
| 32 |
|
73 |
-
| 64 |
|
74 |
|
75 |
Near‑linear scaling thanks to sharded ZeRO‑3 + RCCL optimisations.
|
76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
|
79 |
-
##
|
80 |
-
|
81 |
-
[More Information Needed]
|
82 |
-
|
83 |
-
## Model Card Authors [optional]
|
84 |
-
|
85 |
-
[More Information Needed]
|
86 |
-
|
87 |
-
## Model Card Contact
|
88 |
|
89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
### Results
|
55 |
Up to 40 % parameter reduction (24 B → 15 B) delivers 2× lower TTFT and ≈ 40 % higher tokens/s versus the uncompressed teacher while matching perplexity and divergence metrics—validating SimplePrune as an effective route to deploy KafkaLM in memory‑constrained, sparsity‑accelerated environments.
|
56 |
|
57 |
+
| Metric | Mistral‑24B | **KafkaLM‑15B** | Δ |
|
58 |
|--------|-------------|-----------------|---|
|
59 |
+
| Time‑to‑First‑Token | 4.91 s | **2.46 s** | −50% |
|
60 |
+
| Prompts / s | 4.70 | **6.55** | +38% |
|
61 |
+
| Tokens / s | 579 | **812** | +40% |
|
62 |
+
|
63 |
|
64 |

|
65 |
|
66 |
### Training scalability (distillation run, MI300A cluster)
|
67 |
|
68 |
+
| Nodes | Tokens / s | Speed‑up |
|
69 |
|-------|------------|----------|
|
70 |
+
| 4 | 1 461 | – |
|
71 |
+
| 8 | 3 327 | 2.3 × |
|
72 |
+
| 16 | 7 423 | 5.1 × |
|
73 |
+
| 32 | 15 286 | 10.5 × |
|
74 |
+
| 64 | 25 455 | 17.4 × |
|
75 |
|
76 |
Near‑linear scaling thanks to sharded ZeRO‑3 + RCCL optimisations.
|
77 |
|
78 |
+
@misc{deepcoder2025,
|
79 |
+
title={DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level},
|
80 |
+
author={Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica},
|
81 |
+
howpublished={\url{https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51}},
|
82 |
+
note={Notion Blog},
|
83 |
+
year={2025}
|
84 |
+
}
|
85 |
|
86 |
|
87 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
+
@misc{kafkalm2025,
|
90 |
+
title={Evaluating AMD's MI300A APU: Performance Insights on LLM Training via Knowledge Distillation},
|
91 |
+
author={Dennis Dickmann, Philipp Offenhäuser, Rishabh Saxena, George S. Markomanolis, Alessandro Rigazzi, Patrick Keller, Dennis Hoppe},
|
92 |
+
howpublished={Cray User Group Conference, 2025},
|
93 |
+
note={to be published},
|
94 |
+
year={2025}
|
95 |
+
}
|