what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps?
#3
by
Rupasai
- opened
I am using VLLM library for inference. For some strange reason I am getting qps around 2. Is this expected?. I am starting the server as follows:
vllm serve "model_path" --trust-remote-code --max-num-seqs 1000 --max-model-len 8192 --gpu-memory-utilization 0.95 --limit-mm-per-prompt '{"image": 1}' --tensor-parallel-size 1 --trust-remote-code --port 8080
Input: prompt + single image
Output: ~200 tokens
vllm version I am using: 10.1.1
gpu: H100