zhiyucheng commited on
Commit
b5600fe
·
1 Parent(s): d180e62

update readme

Browse files
Files changed (1) hide show
  1. README.md +1 -4
README.md CHANGED
@@ -50,14 +50,11 @@ The model is quantized with nvidia-modelopt **v0.23.0** <br>
50
  * Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) <br>
51
  ** Data collection method: Automated. <br>
52
  ** Labeling method: Unknown. <br>
53
- * Evaluation Dataset: [MMLU](https://github.com/hendrycks/test) <br>
54
- ** Data collection method: Unknown. <br>
55
- ** Labeling method: N/A. <br>
56
 
57
 
58
  ## Inference:
59
  **Engine:** Tensor(RT)-LLM <br>
60
- **Test Hardware:** B100 <br>
61
 
62
  ## Post Training Quantization
63
  This model was obtained by quantizing the weights and activations of Meta-Llama-3.3-70B-Instruct to FP8 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 
50
  * Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) <br>
51
  ** Data collection method: Automated. <br>
52
  ** Labeling method: Unknown. <br>
 
 
 
53
 
54
 
55
  ## Inference:
56
  **Engine:** Tensor(RT)-LLM <br>
57
+ **Test Hardware:** H100 <br>
58
 
59
  ## Post Training Quantization
60
  This model was obtained by quantizing the weights and activations of Meta-Llama-3.3-70B-Instruct to FP8 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.