Benchmarks - SomeOddCodeGuy's Ramblings

Benchmarks

Mac Studio M3 Ultra Speeds for Qwen3 235b, GPT-OSS-120b, GLM 4.5, and Deepseek V3.1

M3 Ultra Mac Studio 512GB Speeds Qwen3 235b a22b Instruct Q8 in Llama.cpp server (~15k tokens) prompt eval time 4.60 ms per token, 217.29 tokens per second eval time 67.59 ms per token, 14.80 tokens per second total time 146863.82 ms / 15763 tokens (~5k

Benchmarks

Running Deepseek R1 0528 q4_K_M and mlx 4-bit on a Mac Studio M3

Mac Model: M3 Ultra Mac Studio 512GB, 80 core GPU First- this model has a shockingly small KV Cache. If any of you saw my post about running Deepseek V3 q4_K_M, you'd have seen that the KV cache buffer in llama.cpp/koboldcpp was 157GB for

Benchmarks

M3 Ultra Mac Studio 512GB prompt and write speeds for Deepseek V3 0 671b gguf q4_K_M, for those curious

UPDATE 2025-04-13: llama.cpp has had an update that GREATLY improved the prompt processing speed. Please see the new speeds below. Deepseek V3 0324 Q4_K_M w/Flash Attention 4800 token context, responding 552 tokens CtxLimit:4744/8192, Amt:552/4000, Init:0.07s, Process:65.46s (64.02T/

Benchmarks

Running Llama 3.1 405b q6 and Command-A 111b Q8 on M3 Ultra Mac Studio

Below are benchmarks of running Llama 3.1 405b q6 and Command A 111b Q8 on an M3 Ultra 512GB using KoboldCpp The 405b was so miserable to run that I didn't even try flash attention, and flash attention was completely broken with Command-A M3 Ultra Llama 3.

Benchmarks

Mac Speed Comparison: M2 Ultra vs M3 Ultra using KoboldCpp

tl;dr: Running ggufs in Koboldcpp, the M3 is marginally... slower? Slightly faster prompt processing, but slower prompt writing across all models. I added a comparison Llama.cpp run at the bottom; same speed as Kobold, give or take. Setup: * Inference engine: Koboldcpp 1.85.1 * Text: Same text on

Benchmarks

Low Context Speed Comparison: Macbook, Mac Studios, and RTX 4090

It's been a while since my last Mac speed post, so I figured it was about time to post a new one. I've noticed a lot of the old "I get 500 tokens per second!" kind of talk re-appearing, so I figured some cold-hard

Benchmarks

Real World Speeds on the Mac: Koboldcpp Context Shift Edition!

Previous Post Here are some real-world speeds for the Mac M2 Introduction In my previous post, I showed the raw real-world numbers of what non-cached response times would look like for a Mac Studio M2 Ultra. My goal was to demonstrate how well the machine really handles models at full