granite-3.3-8b-instruct-FP8-Dynamic Model Card

This model was optimized for use with VLLM on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.). It utilizes a weight-only FP8 Marlin kernel, providing an efficient W8A16 configuration.

To quantize llmcompressor 0.6.0.1 was used with the following recipe:

recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
Downloads last month
53
Safetensors
Model size
8.17B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sayed0am/granite-3.3-8b-instruct-FP8-Dynamic

Quantized
(33)
this model