alexmarques commited on
Commit
a52e79e
·
verified ·
1 Parent(s): 4618de4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -5
README.md CHANGED
@@ -25,7 +25,8 @@ datasets:
25
  - **Model Developers:** Red Hat (Neural Magic)
26
 
27
  This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
28
- This sparse model obtains 0.366 BERTScore on the test set of trl-lib/tldr, the same result obtained by [nm-testing/Llama-3.1-8B-tldr](https://huggingface.co/nm-testing/Llama-3.1-8B-tldr), a dense model fine-tuned on the same dataset.
 
29
 
30
  ## Deployment
31
 
@@ -33,7 +34,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
33
 
34
  Run the following command to start the vLLM server:
35
  ```bash
36
- vllm serve nm-testing/Sparse-Llama-3.1-8B-tldr-2of4
37
  ```
38
 
39
  Once your server is started, you can query the model using the OpenAI API:
@@ -56,7 +57,7 @@ TITLE: Training sparse LLMs
56
 
57
  POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
58
 
59
- It's super easy to use. See the example in https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4.
60
 
61
  And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
62
  """
@@ -64,7 +65,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
64
  prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
65
 
66
  completion = client.completions.create(
67
- model="nm-testing/Sparse-Llama-3.1-8B-tldr-2of4",
68
  prompt=prompt,
69
  max_tokens=256,
70
  )
@@ -214,7 +215,7 @@ The model was evaluated on the test split of trl-lib/tldr using the Neural Magic
214
  One can reproduce these results by using the following command:
215
 
216
  ```bash
217
- lm_eval --model vllm --model_args "pretrained=nm-testing/Sparse-Llama-3.1-8B-tldr-2of4,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
218
  ```
219
 
220
  <table>
@@ -269,3 +270,45 @@ lm_eval --model vllm --model_args "pretrained=nm-testing/Sparse-Llama-3.1-8B-tld
269
  </td>
270
  </tr>
271
  </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  - **Model Developers:** Red Hat (Neural Magic)
26
 
27
  This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
28
+ This sparse model recovers 100% of the BERTScore (0.366) obtained by the dense model [RedHatAI/Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr).
29
+
30
 
31
  ## Deployment
32
 
 
34
 
35
  Run the following command to start the vLLM server:
36
  ```bash
37
+ vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4
38
  ```
39
 
40
  Once your server is started, you can query the model using the OpenAI API:
 
57
 
58
  POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
59
 
60
+ It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
61
 
62
  And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
63
  """
 
65
  prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
66
 
67
  completion = client.completions.create(
68
+ model="RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4",
69
  prompt=prompt,
70
  max_tokens=256,
71
  )
 
215
  One can reproduce these results by using the following command:
216
 
217
  ```bash
218
+ lm_eval --model vllm --model_args "pretrained=RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
219
  ```
220
 
221
  <table>
 
270
  </td>
271
  </tr>
272
  </table>
273
+
274
+
275
+ ## Inference Performance
276
+
277
+ We evaluated the inference performance of this model using the first 1,000 samples from the training set of the [trl-lib/tldr](https://huggingface.co/datasets/trl-lib/tldr) dataset.
278
+ Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version `0.9.0.1` and [GuideLLM](https://github.com/neuralmagic/guidellm) version `0.2.1`.
279
+
280
+ The figure below presents the **mean end-to-end latency per request** across varying request rates.
281
+ Results are shown for this model, as well as two variants:
282
+ - **Dense:** [Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
283
+ - **Dense-quantized:** [Llama-3.1-8B-tldr-FP8-dynamic](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic)
284
+ - **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
285
+ Although sparsity by itself does not significantly improve performance, when combined with quantization it results in up to 1.6x speedup.
286
+
287
+
288
+ ![Latency](./inference_performance/latency.png)
289
+
290
+
291
+
292
+ <details><summary><strong>Reproduction instructions</strong></summary>
293
+
294
+ To replicate the benchmark:
295
+
296
+ 1. Generate a JSON file containing the first 1,000 training samples:
297
+ ```python
298
+ from datasets import load_dataset
299
+ ds = load_dataset("trl-lib/tldr", split="train").take(1000)
300
+ ds.to_json("tldr_1000.json")
301
+ ```
302
+
303
+ 2. Start a vLLM server using your target model:
304
+ ```bash
305
+ vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr
306
+ ```
307
+
308
+ 3. Run the benchmark with GuideLLM:
309
+ ```
310
+ GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=128 guidellm benchmark --target "http://localhost:8000" --rate-type sweep --data tldr_1000.json
311
+ ```
312
+ > The average output length is approximately 30 tokens per sample. We capped the generation at 128 tokens to reduce performance skew from rare, unusually verbose completions.
313
+
314
+ </details>