RedHatAI
/

Qwen3-0.6B-quantized.w4a16

@@ -37,7 +37,7 @@ This model was obtained by quantizing the weights of [Qwen3-0.6B](https://huggin
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
 Only the weights of the linear operators within transformers blocks are quantized.
-Weights are quantized using a symmetric per-group scheme, with group size 128.
 The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
@@ -80,35 +80,48 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
   This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
-  ```python
-  from llmcompressor.modifiers.quantization import GPTQModifier
-  from llmcompressor.transformers import oneshot
-  from transformers import AutoModelForCausalLM, AutoTokenizer
-  # Load model
-  model_stub = "Qwen/Qwen3-0.6B"
-  model_name = model_stub.split("/")[-1]
-  num_samples = 1024
-  max_seq_len = 8192
-  model = AutoModelForCausalLM.from_pretrained(model_stub)
-  tokenizer = AutoTokenizer.from_pretrained(model_stub)
-  def preprocess_fn(example):
     return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
-  ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
-  ds = ds.map(preprocess_fn)
-  # Configure the quantization algorithm and scheme
-  recipe = GPTQModifier(
-      ignore=["lm_head"],
-      sequential_targets=["Qwen3DecoderLayer"],
-      targets="Linear",
-      scheme="W4A16",
-      dampening_frac=0.1,
   )
   # Apply quantization

 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
 Only the weights of the linear operators within transformers blocks are quantized.
+Weights are quantized using a asymmetric per-group scheme, with group size 128.
 The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
   This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
+```python
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.transformers import oneshot
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model
+model_stub = "Qwen/Qwen3-0.6B"
+model_name = model_stub.split("/")[-1]
+num_samples = 1024
+max_seq_len = 8192
+model = AutoModelForCausalLM.from_pretrained(model_stub)
+tokenizer = AutoTokenizer.from_pretrained(model_stub)
+def preprocess_fn(example):
     return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.map(preprocess_fn)
+# Configure the quantization algorithm and scheme
+recipe = GPTQModifier(
+    ignore=["lm_head"],
+    sequential_targets=["Qwen3DecoderLayer"],
+    targets="Linear",
+    dampening_frac=0.01,
+    config_groups={
+        "group0": {
+            "targets": ["Linear"]
+            "weights": {
+                "num_bits": 4,
+                "type": "int",
+                "strategy": "group",
+                "group_size": 64,
+                "symmetric": False,
+                "actorder": "weight",
+                "observer": "mse",
+            }
+        }
+    }
   )
   # Apply quantization