Add sparse attention weights & modify relevant description of model card
Browse files
README.md
CHANGED
@@ -106,6 +106,26 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|
106 |
--max-num-seqs 32
|
107 |
```
|
108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
|
110 |
|
111 |
```bash
|
|
|
106 |
--max-num-seqs 32
|
107 |
```
|
108 |
|
109 |
+
To deploy the sparse attention version to speed up long context using FastDeploy, you can run the following command.
|
110 |
+
For more details about sparse attention, please refer to the [PLAS Attention](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/plas_attention.md).
|
111 |
+
|
112 |
+
```bash
|
113 |
+
export FD_ATTENTION_BACKEND="PLAS_ATTN"
|
114 |
+
|
115 |
+
python -m fastdeploy.entrypoints.openai.api_server \
|
116 |
+
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
117 |
+
--port 8180 \
|
118 |
+
--metrics-port 8181 \
|
119 |
+
--quantization wint4 \
|
120 |
+
--tensor-parallel-size 4 \
|
121 |
+
--engine-worker-queue-port 8182 \
|
122 |
+
--max-model-len 131072 \
|
123 |
+
--max-num-seqs 32 \
|
124 |
+
--max-num-batched-tokens 8192 \
|
125 |
+
--enable-chunked-prefill \
|
126 |
+
--plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'
|
127 |
+
```
|
128 |
+
|
129 |
To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
|
130 |
|
131 |
```bash
|