M3 Ultra Mac Studio 512GB prompt and write speeds for Deepseek V3 0 671b gguf q4_K_M, for those curious

UPDATE 2025-04-13:

llama.cpp has had an update that GREATLY improved the prompt processing speed. Please see the new speeds below.

Deepseek V3 0324 Q4_K_M w/Flash Attention

4800 token context, responding 552 tokens

CtxLimit:4744/8192,
Amt:552/4000, Init:0.07s,
Process:65.46s (64.02T/s),
Generate:50.69s (10.89T/s),
Total:116.15s

12700 token context, responding 342 tokens

CtxLimit:12726/16384,
Amt:342/4000, Init:0.07s,
Process:210.53s (58.82T/s),
Generate:51.30s (6.67T/s),
Total:261.83s

Honestly, very usable for me. Very much so.

The KV cache sizes:

  • 32k: 157380.00 MiB
  • 16k: 79300.00 MiB
  • 8k: 40260.00 MiB
  • 8k quantkv 1: 21388.12 MiB (broke the model; response was insane)

The model load size:

load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: Metal model buffer size = 387629.18 MiB

ORIGINAL:

For anyone curious, here's the gguf numbers for Deepseek V3 q4_K_M (the older V3, not the newest one from this week). I loaded it up last night and tested some prompts:

M3 Ultra Mac Studio 512GB Deepseek V3 671b q4_K_M gguf without Flash Attention

CtxLimit:8102/16384, 
Amt:902/4000, Init:0.04s, 
Process:792.65s (9.05T/s), 
Generate:146.21s (6.17T/s), 
Total:938.86s

Note above: normally I run in debugmode to get the ms per token, but forgot to enable it this time. Comes out to about 110ms per token for prompt processing, and about 162ms per token for prompt response.

M3 Ultra Mac Studio 512GB Deepseek V3 671b q4_K_M gguf with Flash Attention On

CtxLimit:7847/16384, 
Amt:647/4000, Init:0.04s, 
Process:793.14s (110.2ms/T = 9.08T/s), 
Generate:103.81s (160.5ms/T = 6.23T/s), 
Total:896.95s (0.72T/s)

In comparison, here is Llama 3.3 70b q8 with Flash Attention On

CtxLimit:6293/16384, 
Amt:222/800, Init:0.07s, 
Process:41.22s (8.2ms/T = 121.79T/s), 
Generate:35.71s (160.8ms/T = 6.22T/s), 
Total:76.92s (2.89T/s)