Benchmarks

Mac Studio M3 Ultra Speeds for Qwen3 235b, GPT-OSS-120b, GLM 4.5, and Deepseek V3.1

M3 Ultra Mac Studio 512GB Speeds

Qwen3 235b a22b Instruct Q8 in Llama.cpp server

(~15k tokens)

prompt eval time   
    4.60 ms per token, 
    217.29 tokens per second

eval time   
    67.59 ms per token, 
    14.80 tokens per second

total time 
    146863.82 ms / 15763 tokens

(~5k tokens)

prompt eval time   
    4.90 ms per token, 
    204.24 tokens per second

eval time   
    57.18 ms per token, 
    17.49 tokens per second

total time 
    65510.45 ms / 5649 tokens

GPT-OSS-120b Unsloth fp16 gguf in Llama.cpp server

(~5k tokens)

prompt eval time   
    1.37 ms per token, 
    732.57 tokens per second

eval time   
    15.90 ms per token, 
    62.90 tokens per second

total time 
    8526.55 ms / 4447 tokens

GLM 4.5 Q8 in Llama.cpp server

(~20k tokens)

prompt eval time   
    7.26 ms per token, 
    137.82 tokens per second

eval time   
    103.33 ms per token, 
    9.68 tokens per second

total time 
    202089.84 ms / 21377 tokens

(15k tokens)

prompt eval time   
    7.16 ms per token, 
    139.64 tokens per second

eval time   
    96.64 ms per token, 
    10.35 tokens per second

total time 
    200516.47 ms / 16505 tokens

(~10k tokens)

prompt eval time   
    6.64 ms per token, 
    150.55 tokens per second

eval time   
    88.75 ms per token, 
    11.27 tokens per second

total time 
    108213.31 ms / 10927 tokens

(~5k tokens)

prompt eval time   
    6.86 ms per token, 
    145.70 tokens per second

eval time   
    81.31 ms per token, 
    12.30 tokens per second

total time 
    64483.49 ms / 6000 tokens

Deepseek V3.1 Q5_K_M in Llama.cpp server

(~13k tokens)

prompt eval time   
    14.22 ms per token, 
    70.30 tokens per second

eval time   
    264.86 ms per token, 
    3.78 tokens per second

total time 
    253415.56 ms / 13217 tokens

(~5k tokens)

prompt eval time   
    9.68 ms per token, 
    103.30 tokens per second

eval time   
    144.04 ms per token, 
    6.94 tokens per second

total time 
    119343.67 ms / 5763 tokens

(~3k tokens)

prompt eval time   
    11.92 ms per token, 
    83.86 tokens per second

eval time   
    107.64 ms per token, 
    9.29 tokens per second

total time 
    65396.57 ms / 3269 tokens

If You Have the Hardware- Use it to Learn!

If you've never messed with open source LLMs and you jumped on the ClawdBot/OpenClaw hype train: take some time to learn more about how local models work. You likely went through the trouble of getting a Mac Mini, so you now have a nice little test box

An Analogy to Help Understand Mixture of Experts

If you're having a hard time understanding MoE strength vs dense models, and roughly where they might land when comparing them, think about this super oversimplified analogy. I'm hoping it makes sense: The Scenario Imagine a paid trivia competition, but all the questions are about carpentry

I Won't Miss The Cold...

This has nothing to do with technology, but just so you know- I'm a tropical beastie, and I absolutely will not miss the 22 degree weather this pass weekend. I am no longer built for this. That is all.

My Personal Guide for Developing Software with AI Assistance - 2026 Edition

What's Changed Since 2024 So back in May of 2024 I wrote the first version of this little guide, at a time when agents were absolute crap and Wilmer was still in a state that couldn't even be called v0.01. Back then it got a

M3 Ultra Mac Studio 512GB Speeds

Qwen3 235b a22b Instruct Q8 in Llama.cpp server

(~15k tokens)

(~5k tokens)

GPT-OSS-120b Unsloth fp16 gguf in Llama.cpp server

(~5k tokens)

GLM 4.5 Q8 in Llama.cpp server

(~20k tokens)

(15k tokens)

(~10k tokens)

(~5k tokens)

Deepseek V3.1 Q5_K_M in Llama.cpp server

(~13k tokens)

(~5k tokens)

(~3k tokens)

Read more

If You Have the Hardware- Use it to Learn!

An Analogy to Help Understand Mixture of Experts

I Won't Miss The Cold...

My Personal Guide for Developing Software with AI Assistance - 2026 Edition