Reply with 333333333....

by pipilok - opened 30 days ago

30 days ago

Problem with your version with latest llama.cpp (b6144):

F:\llama.cpp\llama.cpp\build\bin\Release\llama-server.exe --port 9292 --flash-attn -ngl 999 --n-gpu-layers 999 --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 32768 --model .\models\BasedBase\Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2\Qwen3-30B-A3B-Coder-480B-Distill-v2-Q8_0.gguf --temp 0.7 --repeat-penalty 1.0 --min-p 0.00 --top-k 20 --top-p 0.8 --n-cpu-moe 2 --no-kv-offload -t 4

BasedBase

Owner 30 days ago

This is most likely due to flash attention being enabled. When I tested using flash attention it would sometimes get caught in a coding loop or just degrade the quality if generated code. I also tested flash attention on the base qwen3 coder 30b model and it noticeably degraded coding performance.

leniad

9 days ago

what backend do you recommend instead?

BasedBase

Owner 8 days ago

what backend do you recommend instead?

LM studio is what I use

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment