dev7halo/A.X-4.0-AWQ-tp2 · Question About Model Quantization and Pruning Approach

Hi there,

I came across your impressive project and had a quick question out of curiosity. I noticed that the original model is in bfloat16 (bf16), but the quantized version appears to be in float16 (fp16). I was wondering if pruning was part of your quantization process, or if there's another reason for this change in precision format.

Would you mind sharing a bit more detail about your quantization methodology or overall workflow? I'd really appreciate any insights you could offer.

Thanks in advance, and great work again!