wanderhzz commited on
Commit
c96334d
·
verified ·
1 Parent(s): f41678f

Add sparse attention weights & modify relevant description of model card

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -106,6 +106,26 @@ python -m fastdeploy.entrypoints.openai.api_server \
106
  --max-num-seqs 32
107
  ```
108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
110
 
111
  ```bash
 
106
  --max-num-seqs 32
107
  ```
108
 
109
+ To deploy the sparse attention version to speed up long context using FastDeploy, you can run the following command.
110
+ For more details about sparse attention, please refer to the [PLAS Attention](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/plas_attention.md).
111
+
112
+ ```bash
113
+ export FD_ATTENTION_BACKEND="PLAS_ATTN"
114
+
115
+ python -m fastdeploy.entrypoints.openai.api_server \
116
+ --model baidu/ERNIE-4.5-300B-A47B-Paddle \
117
+ --port 8180 \
118
+ --metrics-port 8181 \
119
+ --quantization wint4 \
120
+ --tensor-parallel-size 4 \
121
+ --engine-worker-queue-port 8182 \
122
+ --max-model-len 131072 \
123
+ --max-num-seqs 32 \
124
+ --max-num-batched-tokens 8192 \
125
+ --enable-chunked-prefill \
126
+ --plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'
127
+ ```
128
+
129
  To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
130
 
131
  ```bash