alexmarques commited on
Commit
8a3538d
·
verified ·
1 Parent(s): 87d9d72

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +374 -0
README.md ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - Qwen/Qwen3-0.6B
7
+ tags:
8
+ - neuralmagic
9
+ - redhat
10
+ - llmcompressor
11
+ - quantized
12
+ - INT4
13
+ ---
14
+
15
+ # Qwen3-0.6B-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen3ForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Intended Use Cases:**
24
+ - Reasoning.
25
+ - Function calling.
26
+ - Subject matter experts via fine-tuning.
27
+ - Multilingual instruction following.
28
+ - Translation.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
30
+ - **Release Date:** 05/05/2025
31
+ - **Version:** 1.0
32
+ - **Model Developers:** RedHat (Neural Magic)
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights of [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) to INT4 data type.
37
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
38
+
39
+ Only the weights of the linear operators within transformers blocks are quantized.
40
+ Weights are quantized using a symmetric per-group scheme, with group size 128.
41
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
42
+
43
+
44
+ ## Deployment
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+
52
+ model_id = "RedHatAI/Qwen3-0.6B-quantized.w4a16"
53
+ number_gpus = 1
54
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)
55
+
56
+ messages = [
57
+ {"role": "user", "content": prompt}
58
+ ]
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
63
+
64
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+
66
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
67
+
68
+ outputs = llm.generate(prompts, sampling_params)
69
+
70
+ generated_text = outputs[0].outputs[0].text
71
+ print(generated_text)
72
+ ```
73
+
74
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
+
76
+ ## Creation
77
+
78
+ <details>
79
+ <summary>Creation details</summary>
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+
83
+ ```python
84
+ from llmcompressor.modifiers.quantization import GPTQModifier
85
+ from llmcompressor.transformers import oneshot
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+
88
+ # Load model
89
+ model_stub = "Qwen/Qwen3-0.6B"
90
+ model_name = model_stub.split("/")[-1]
91
+
92
+ num_samples = 1024
93
+ max_seq_len = 8192
94
+
95
+ model = AutoModelForCausalLM.from_pretrained(model_stub)
96
+
97
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
98
+
99
+ def preprocess_fn(example):
100
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
101
+
102
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
103
+ ds = ds.map(preprocess_fn)
104
+
105
+ # Configure the quantization algorithm and scheme
106
+ recipe = GPTQModifier(
107
+ ignore=["lm_head"],
108
+ sequential_targets=["Qwen3DecoderLayer"],
109
+ targets="Linear",
110
+ scheme="W4A16",
111
+ dampening_frac=0.1,
112
+ )
113
+
114
+ # Apply quantization
115
+ oneshot(
116
+ model=model,
117
+ dataset=ds,
118
+ recipe=recipe,
119
+ max_seq_length=max_seq_len,
120
+ num_calibration_samples=num_samples,
121
+ )
122
+
123
+ # Save to disk in compressed-tensors format
124
+ save_path = model_name + "-quantized.w4a16"
125
+ model.save_pretrained(save_path)
126
+ tokenizer.save_pretrained(save_path)
127
+ print(f"Model and tokenizer saved to: {save_path}")
128
+ ```
129
+ </details>
130
+
131
+
132
+
133
+ ## Evaluation
134
+
135
+ The model was evaluated on the OpenLLM leaderboard tasks (version 1), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [vLLM](https://docs.vllm.ai/en/stable/).
136
+
137
+ <details>
138
+ <summary>Evaluation details</summary>
139
+
140
+ ```
141
+ lm_eval \
142
+ --model vllm \
143
+ --model_args pretrained="RedHatAI/Qwen3-0.6B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
144
+ --tasks openllm \
145
+ --apply_chat_template\
146
+ --fewshot_as_multiturn \
147
+ --batch_size auto
148
+ ```
149
+ </details>
150
+
151
+ ### Accuracy
152
+
153
+ <table>
154
+ <tr>
155
+ <th>Category
156
+ </th>
157
+ <th>Benchmark
158
+ </th>
159
+ <th>Qwen3-0.6B
160
+ </th>
161
+ <th>Qwen3-0.6B-quantized.w4a16<br>(this model)
162
+ </th>
163
+ <th>Recovery
164
+ </th>
165
+ </tr>
166
+ <tr>
167
+ <td rowspan="7" ><strong>OpenLLM v1</strong>
168
+ </td>
169
+ <td>MMLU (5-shot)
170
+ </td>
171
+ <td>80.96
172
+ </td>
173
+ <td>80.36
174
+ </td>
175
+ <td>99.3%
176
+ </td>
177
+ </tr>
178
+ <tr>
179
+ <td>ARC Challenge (25-shot)
180
+ </td>
181
+ <td>69.03
182
+ </td>
183
+ <td>68.69
184
+ </td>
185
+ <td>99.5%
186
+ </td>
187
+ </tr>
188
+ <tr>
189
+ <td>GSM-8K (5-shot, strict-match)
190
+ </td>
191
+ <td>87.64
192
+ </td>
193
+ <td>85.97
194
+ </td>
195
+ <td>98.1%
196
+ </td>
197
+ </tr>
198
+ <tr>
199
+ <td>Hellaswag (10-shot)
200
+ </td>
201
+ <td>71.10
202
+ </td>
203
+ <td>71.18
204
+ </td>
205
+ <td>100.1%
206
+ </td>
207
+ </tr>
208
+ <tr>
209
+ <td>Winogrande (5-shot)
210
+ </td>
211
+ <td>69.77
212
+ </td>
213
+ <td>70.90
214
+ </td>
215
+ <td>100.5%
216
+ </td>
217
+ </tr>
218
+ <tr>
219
+ <td>TruthfulQA (0-shot, mc2)
220
+ </td>
221
+ <td>58.63
222
+ </td>
223
+ <td>58.86
224
+ </td>
225
+ <td>100.4%
226
+ </td>
227
+ </tr>
228
+ <tr>
229
+ <td><strong>Average</strong>
230
+ </td>
231
+ <td><strong>72.86</strong>
232
+ </td>
233
+ <td><strong>72.52</strong>
234
+ </td>
235
+ <td><strong>99.6%</strong>
236
+ </td>
237
+ </tr>
238
+ <tr>
239
+ <td rowspan="7" ><strong>OpenLLM v2</strong>
240
+ </td>
241
+ <td>MMLU-Pro (5-shot)
242
+ </td>
243
+ <td>17.25
244
+ </td>
245
+ <td>14.27
246
+ </td>
247
+ <td>---
248
+ </td>
249
+ </tr>
250
+ <tr>
251
+ <td>IFEval (0-shot)
252
+ </td>
253
+ <td>62.83
254
+ </td>
255
+ <td>55.81
256
+ </td>
257
+ <td>88.8%
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td>BBH (3-shot)
262
+ </td>
263
+ <td>4.23
264
+ </td>
265
+ <td>1.63
266
+ </td>
267
+ <td>---
268
+ </td>
269
+ </tr>
270
+ <tr>
271
+ <td>Math-lvl-5 (4-shot)
272
+ </td>
273
+ <td>18.26
274
+ </td>
275
+ <td>10.26
276
+ </td>
277
+ <td>---
278
+ </td>
279
+ </tr>
280
+ <tr>
281
+ <td>GPQA (0-shot)
282
+ </td>
283
+ <td>0.00
284
+ </td>
285
+ <td>0.00
286
+ </td>
287
+ <td>---
288
+ </td>
289
+ </tr>
290
+ <tr>
291
+ <td>MuSR (0-shot)
292
+ </td>
293
+ <td>0.00
294
+ </td>
295
+ <td>0.00
296
+ </td>
297
+ <td>---
298
+ </td>
299
+ </tr>
300
+ <tr>
301
+ <td><strong>Average</strong>
302
+ </td>
303
+ <td><strong>17.10</strong>
304
+ </td>
305
+ <td><strong>13.66</strong>
306
+ </td>
307
+ <td><strong>---</strong>
308
+ </td>
309
+ </tr>
310
+ <tr>
311
+ <td><strong>Multilingual</strong>
312
+ </td>
313
+ <td>MGSM (0-shot)
314
+ </td>
315
+ <td>19.70
316
+ </td>
317
+ <td>19.90
318
+ </td>
319
+ <td>---
320
+ </td>
321
+ </tr>
322
+ <tr>
323
+ <td rowspan="6" ><strong>Reasoning<br>(generation)</strong>
324
+ </td>
325
+ <td>AIME 2024
326
+ </td>
327
+ <td>9.69
328
+ </td>
329
+ <td>3.44
330
+ </td>
331
+ <td>---
332
+ </td>
333
+ </tr>
334
+ <tr>
335
+ <td>AIME 2025
336
+ </td>
337
+ <td>13.13
338
+ </td>
339
+ <td>6.98
340
+ </td>
341
+ <td>---
342
+ </td>
343
+ </tr>
344
+ <tr>
345
+ <td>GPQA diamond
346
+ </td>
347
+ <td>29.29
348
+ </td>
349
+ <td>27.78
350
+ </td>
351
+ <td>94.8%
352
+ </td>
353
+ </tr>
354
+ <tr>
355
+ <td>Math-lvl-5
356
+ </td>
357
+ <td>71.60
358
+ </td>
359
+ <td>70.60
360
+ </td>
361
+ <td>98.6%
362
+ </td>
363
+ </tr>
364
+ <tr>
365
+ <td>LiveCodeBench
366
+ </td>
367
+ <td>12.83
368
+ </td>
369
+ <td>8.35
370
+ </td>
371
+ <td>---
372
+ </td>
373
+ </tr>
374
+ </table>