Reply with 333333333....
#2
by
pipilok
- opened
Problem with your version with latest llama.cpp (b6144):
F:\llama.cpp\llama.cpp\build\bin\Release\llama-server.exe --port 9292 --flash-attn -ngl 999 --n-gpu-layers 999 --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 32768 --model .\models\BasedBase\Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2\Qwen3-30B-A3B-Coder-480B-Distill-v2-Q8_0.gguf --temp 0.7 --repeat-penalty 1.0 --min-p 0.00 --top-k 20 --top-p 0.8 --n-cpu-moe 2 --no-kv-offload -t 4
This is most likely due to flash attention being enabled. When I tested using flash attention it would sometimes get caught in a coding loop or just degrade the quality if generated code. I also tested flash attention on the base qwen3 coder 30b model and it noticeably degraded coding performance.
what backend do you recommend instead?
what backend do you recommend instead?
LM studio is what I use