what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps?

by Rupasai - opened 5 days ago

5 days ago

•

I am using VLLM library for inference. For some strange reason I am getting qps around 2. Is this expected?. I am starting the server as follows:

vllm serve "model_path" --trust-remote-code --max-num-seqs 1000 --max-model-len 8192 --gpu-memory-utilization 0.95 --limit-mm-per-prompt '{"image": 1}' --tensor-parallel-size 1 --trust-remote-code --port 8080

Input: prompt + single image
Output: ~200 tokens
vllm version I am using: 10.1.1
gpu: H100

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment