gpt-oss 120b AMD GPU output problem
#136
by
gfatigati
- opened
I'm using gpt-oss 120b on AMD M250 gpus, launching with vllm rocm image.
The model works initially well. After some queries, the response is something like "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!". I suspect the context len filled up.
Trying --max-model-lenght 8192, the model get the error
Error code: 400 - {‘error’: {‘message’: “This model’s maximum context length is 8192 tokens, however you requested 21026 tokens (21026 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}
How can I solve such problem? It seems the context len overflow, but trying to set some size get the second error.
Thanks.