|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- de |
|
- es |
|
- fr |
|
- ja |
|
- pt |
|
- ar |
|
- cs |
|
- it |
|
- ko |
|
- nl |
|
- zh |
|
base_model: |
|
- ibm-granite/granite-3.3-8b-instruct |
|
pipeline_tag: text-generation |
|
--- |
|
# granite-3.3-8b-instruct-FP8-Dynamic Model Card |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model was optimized for use with VLLM on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.). It utilizes a weight-only FP8 Marlin kernel, providing an efficient W8A16 configuration. |
|
|
|
To quantize `llmcompressor 0.6.0.1` was used with the following recipe: |
|
```python |
|
recipe = QuantizationModifier( |
|
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) |
|
``` |
|
|