gpt-oss 120b AMD GPU output problem

#136
by gfatigati - opened

I'm using gpt-oss 120b on AMD M250 gpus, launching with vllm rocm image.

The model works initially well. After some queries, the response is something like "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!". I suspect the context len filled up.

Trying --max-model-lenght 8192, the model get the error

Error code: 400 - {‘error’: {‘message’: “This model’s maximum context length is 8192 tokens, however you requested 21026 tokens (21026 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

How can I solve such problem? It seems the context len overflow, but trying to set some size get the second error.

Thanks.

Sign up or log in to comment