Running Deepseek R1 0528 q4_K_M and mlx 4-bit on a Mac Studio M3

Mac Model: M3 Ultra Mac Studio 512GB, 80 core GPU

First- this model has a shockingly small KV Cache. If any of you saw my post about running Deepseek V3 q4_K_M, you'd have seen that the KV cache buffer in llama.cpp/koboldcpp was 157GB for 32k of context. I expected to see similar here.

Not even close.

64k context on this model is barely 8GB. Below is the buffer loading this model directly in llama.cpp with no special options; just specifying 65536 context, a port and a host. That's it. No MLA, no quantized cache.

EDIT: Llama.cpp runs MLA by default.

65536 context:

llama_kv_cache_unified:      Metal KV buffer size =  8296.00 MiB
llama_kv_cache_unified: KV self size  = 8296.00 MiB, K (f16): 4392.00 MiB, V (f16): 3904.00 MiB

131072k context:

llama_kv_cache_unified:      Metal KV buffer size = 16592.00 MiB
llama_kv_cache_unified: KV self size  = 16592.00 MiB, K (f16): 8784.00 MiB, V (f16): 7808.00 MiB

Speed wise- it's a fair bit on the slow side, but if this model is as good as they say it is, I really don't mind.

Example: ~11,000 token prompt:

llama.cpp server (no flash attention) (~9 minutes)

prompt eval time =  144330.20 ms / 11090 tokens (13.01 ms per token, 76.84 tokens per second)
eval time = 390034.81 ms / 1662 tokens (234.68 ms per token, 4.26 tokens per second)
total time =  534365.01 ms / 12752 tokens

MLX 4-bit for the same prompt (~2.5x speed) (245sec or ~4 minutes):

2025-05-30 23:06:16,815 - DEBUG - Prompt: 189.462 tokens-per-sec
2025-05-30 23:06:16,815 - DEBUG - Generation: 11.154 tokens-per-sec
2025-05-30 23:06:16,815 - DEBUG - Peak memory: 422.248 GB

Note- Tried flash attention in llama.cpp, and that went horribly. The prompt processing slowed to an absolute crawl. It would have taken longer to process the prompt than the non -fa run took for the whole prompt + response.

Another important note- when they say not to use System Prompts, they mean it. I struggled with this model at first, until I finally completely stripped the system prompt out and jammed all my instructions into the user prompt instead. The model became far more intelligent after that. Specifically, if I passed in a system prompt, it would NEVER output the initial <s> tag no matter what I said or did. But if I don't use a system prompt, it always outputs the initial <s> tag appropriately.

I haven't had a chance to deep dive into this thing yet to see if running a 4bit version really harms the output quality or not, but I at least wanted to give a sneak peak into what it looks like running it.