Model does not generate tokens when served with 4 RTX 6000 ADA GPUs on vLLM
I am having problems serving the model in vllm v0.10.1.1
Command used:
vllm serve $PATH
--served-model-name gpt-oss-120b-FP8-Dynamic
--host $IP
--port $PORT
--gpu-memory-utilization 0.95
--max-model-len 16384
--max_num_seqs 4
--tensor-parallel-size 4
--pipeline-parallel-size 1
--async-scheduling
The server keeps running, but the generation remains at 0 tokens, with metrics stuck as follows:
(APIServer pid=3232437) INFO 08-27 15:07:46 [loggers.py:123] Engine 000: Avg prompt throughput: 8.3 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3232437) INFO 08-27 15:07:56 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3232437) INFO 08-27 15:08:06 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3232437) INFO 08-27 15:08:16 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3232437) INFO 08-27 15:08:26 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%