Mac Speed Comparison: M2 Ultra vs M3 Ultra using KoboldCpp
tl;dr: Running ggufs in Koboldcpp, the M3 is marginally... slower? Slightly faster prompt processing, but slower prompt writing across all models. I added a comparison Llama.cpp run at the bottom; same speed as Kobold, give or take.
Setup:
- Inference engine: Koboldcpp 1.85.1
- Text: Same text on ALL models. Token size differences are due to tokenizer differences
- Temp: 0.01; all other samplers disabled
Computers:
- M3 Ultra:
- 512GB RAM
- 80 GPU Cores
- M2 Ultra:
- 192GB RAM
- 76 GPU Cores

Notes:
- Qwen2.5 Coder and Llama 3.1 8b are more sensitive to temp than Llama 3.3 70b
- All inference was first prompt after model load
- All models are q8, as on Mac q8 is the fastest gguf quant (see my previous posts on Mac speeds)
Llama 3.1 8b q8
M2 Ultra:
CtxLimit:12433/32768,
Amt:386/4000, Init:0.02s,
Process:13.56s (1.1ms/T = 888.55T/s),
Generate:14.41s (37.3ms/T = 26.79T/s),
Total:27.96s (13.80T/s)
M3 Ultra:
CtxLimit:12408/32768,
Amt:361/4000, Init:0.01s,
Process:12.05s (1.0ms/T = 999.75T/s),
Generate:13.62s (37.7ms/T = 26.50T/s),
Total:25.67s (14.06T/s)
Mistral Small 24b q8
M2 Ultra:
CtxLimit:13300/32768,
Amt:661/4000, Init:0.07s,
Process:34.86s (2.8ms/T = 362.50T/s),
Generate:45.43s (68.7ms/T = 14.55T/s),
Total:80.29s (8.23T/s)
M3 Ultra:
CtxLimit:13300/32768,
Amt:661/4000, Init:0.04s,
Process:31.97s (2.5ms/T = 395.28T/s),
Generate:46.27s (70.0ms/T = 14.29T/s),
Total:78.24s (8.45T/s)
Qwen2.5 32b Coder q8 with 1.5b speculative decoding
M2 Ultra:
CtxLimit:13215/32768,
Amt:473/4000, Init:0.06s,
Process:59.38s (4.7ms/T = 214.59T/s),
Generate:34.70s (73.4ms/T = 13.63T/s),
Total:94.08s (5.03T/s)
M3 Ultra:
CtxLimit:13271/32768,
Amt:529/4000, Init:0.05s,
Process:52.97s (4.2ms/T = 240.56T/s),
Generate:43.58s (82.4ms/T = 12.14T/s),
Total:96.55s (5.48T/s)
Qwen2.5 32b Coder q8 WITHOUT speculative decoding
M2 Ultra:
CtxLimit:13315/32768,
Amt:573/4000, Init:0.07s,
Process:53.44s (4.2ms/T = 238.42T/s),
Generate:64.77s (113.0ms/T = 8.85T/s),
Total:118.21s (4.85T/s)
M3 Ultra:
CtxLimit:13285/32768,
Amt:543/4000, Init:0.04s,
Process:49.35s (3.9ms/T = 258.22T/s),
Generate:62.51s (115.1ms/T = 8.69T/s),
Total:111.85s (4.85T/s)
Llama 3.3 70b q8 with 3b speculative decoding
M2 Ultra:
CtxLimit:12519/32768,
Amt:472/4000, Init:0.04s,
Process:116.18s (9.6ms/T = 103.69T/s),
Generate:54.99s (116.5ms/T = 8.58T/s),
Total:171.18s (2.76T/s)
M3 Ultra:
CtxLimit:12519/32768,
Amt:472/4000, Init:0.02s,
Process:103.12s (8.6ms/T = 116.77T/s),
Generate:63.74s (135.0ms/T = 7.40T/s),
Total:166.86s (2.83T/s)
Llama 3.3 70b q8 WITHOUT speculative decoding
M2 Ultra:
CtxLimit:12519/32768,
Amt:472/4000, Init:0.03s,
Process:104.74s (8.7ms/T = 115.01T/s),
Generate:98.15s (207.9ms/T = 4.81T/s),
Total:202.89s (2.33T/s)
M3 Ultra:
CtxLimit:12519/32768,
Amt:472/4000, Init:0.01s,
Process:96.67s (8.0ms/T = 124.62T/s),
Generate:103.09s (218.4ms/T = 4.58T/s),
Total:199.76s (2.36T/s)
Llama.cpp Server Comparison Run :: Llama 3.3 70b q8 WITHOUT Speculative Decoding
M2 Ultra
prompt eval time = 105195.24 ms / 12051 tokens (
8.73 ms per token, 114.56 tokens per second)
eval time = 78102.11 ms / 377 tokens (
207.17 ms per token, 4.83 tokens per second)
total time = 183297.35 ms / 12428 tokens
M3 Ultra
prompt eval time = 96696.48 ms / 12051 tokens (
8.02 ms per token, 124.63 tokens per second)
eval time = 82026.89 ms / 377 tokens (
217.58 ms per token, 4.60 tokens per second)
total time = 178723.36 ms / 12428 tokens