Reply with 333333333....

#2
by pipilok - opened

Problem with your version with latest llama.cpp (b6144):

F:\llama.cpp\llama.cpp\build\bin\Release\llama-server.exe --port 9292 --flash-attn -ngl 999 --n-gpu-layers 999 --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 32768 --model .\models\BasedBase\Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2\Qwen3-30B-A3B-Coder-480B-Distill-v2-Q8_0.gguf --temp 0.7 --repeat-penalty 1.0 --min-p 0.00 --top-k 20 --top-p 0.8 --n-cpu-moe 2 --no-kv-offload -t 4

изображение.png

This is most likely due to flash attention being enabled. When I tested using flash attention it would sometimes get caught in a coding loop or just degrade the quality if generated code. I also tested flash attention on the base qwen3 coder 30b model and it noticeably degraded coding performance.

what backend do you recommend instead?

what backend do you recommend instead?

LM studio is what I use

Sign up or log in to comment