Benchmarks

M3 Ultra Mac Studio 512GB prompt and write speeds for Deepseek V3 0 671b gguf q4_K_M, for those curious

UPDATE 2025-04-13:

llama.cpp has had an update that GREATLY improved the prompt processing speed. Please see the new speeds below.

Deepseek V3 0324 Q4_K_M w/Flash Attention

4800 token context, responding 552 tokens

CtxLimit:4744/8192,
Amt:552/4000, Init:0.07s,
Process:65.46s (64.02T/s),
Generate:50.69s (10.89T/s),
Total:116.15s

12700 token context, responding 342 tokens

CtxLimit:12726/16384,
Amt:342/4000, Init:0.07s,
Process:210.53s (58.82T/s),
Generate:51.30s (6.67T/s),
Total:261.83s

Honestly, very usable for me. Very much so.

The KV cache sizes:

32k: 157380.00 MiB
16k: 79300.00 MiB
8k: 40260.00 MiB
8k quantkv 1: 21388.12 MiB (broke the model; response was insane)

The model load size:

load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: Metal model buffer size = 387629.18 MiB

ORIGINAL:

For anyone curious, here's the gguf numbers for Deepseek V3 q4_K_M (the older V3, not the newest one from this week). I loaded it up last night and tested some prompts:

M3 Ultra Mac Studio 512GB Deepseek V3 671b q4_K_M gguf without Flash Attention

CtxLimit:8102/16384, 
Amt:902/4000, Init:0.04s, 
Process:792.65s (9.05T/s), 
Generate:146.21s (6.17T/s), 
Total:938.86s

Note above: normally I run in debugmode to get the ms per token, but forgot to enable it this time. Comes out to about 110ms per token for prompt processing, and about 162ms per token for prompt response.

M3 Ultra Mac Studio 512GB Deepseek V3 671b q4_K_M gguf with Flash Attention On

CtxLimit:7847/16384, 
Amt:647/4000, Init:0.04s, 
Process:793.14s (110.2ms/T = 9.08T/s), 
Generate:103.81s (160.5ms/T = 6.23T/s), 
Total:896.95s (0.72T/s)

In comparison, here is Llama 3.3 70b q8 with Flash Attention On

CtxLimit:6293/16384, 
Amt:222/800, Init:0.07s, 
Process:41.22s (8.2ms/T = 121.79T/s), 
Generate:35.71s (160.8ms/T = 6.22T/s), 
Total:76.92s (2.89T/s)

Reddit: The Tale of Two Bans

So, my new account SomeOddCodeGuy_v2 just got permanently banned; not a shadowban this time, but a proper ban. So, a little backstory: My original reddit account, which had most of my benchmarks and whatnot on it, had gotten Shadowbanned after I posted a link to another reddit post, and

I haven't disappeared...

It's been a few weeks since I've posted or made any changes on Wilmer; I haven't stopped or lost interest, but rather I'm about to change jobs and I've been heads down on transition stuff before I leave my current

Understanding MoE Offloading

I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.

My Next Steps with Wilmer

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what

Read more

Reddit: The Tale of Two Bans

I haven't disappeared...

Understanding MoE Offloading

My Next Steps with Wilmer