Here Are Some Real World Speeds For the Mac M2 Ultra, In Case You Were Curious

One thing I see a lot of when folks are talking about how fast a machine/gpu/whatever is that they will answer the question with this vague "I get 12 tokens per second!", which honestly doesn't feel like it gives a clear answer by itself if you get no other info.

So, with that in mind, I wanted to give actual numbers for the M2 Ultra. If you are curious what using a Mac Studio for inference looks like- here you go.

Here is some info on the setup:

This is an M2 Ultra Mac Studio with 192GB of RAM
I used Oobabooga for the inference program
Except for Miqu which only had a q5, I used q8 for everything. I suspect there isn't a huge difference in speed between miqu q5 and llama2 q8.
These are ggufs, run in llama.cpp. Oobabooga uses llama-cpp-python wrapper.
These numbers are for first run message after a model load. NO "prefix match hit" for any of these.
Times can vary a bit, so just keep in mind that you're seeing 1 example of each context size + model size range, and that sometimes the response token amounts are different. In other words, if we re-ran the 120b @ 16k context 10 times, we might get a bit higher or lower average time. This post is more for a "this is a general idea of what you're gonna get" rather than a hard "always expect this number".
No, I probably won't redo this using your favorite program of choice. This was wasn't fun to put together, and I regretted deciding to do it halfway through =D I just got stubborn and wanted to see it to the end

All of the tests from 120b to 7b were done with the base settings of 147GB max VRAM. However, to perform the 155b test, I ran the command "sudo sysctl iogpu.wired_limit_mb=170000" to increase my max vram to 170GB. That is the only model performed while the system had this command active. Not sure if it affects speed, but wanted to point that out.

UPDATE: I added two q4 comparisons to the bottom. I use q8 because there is very little difference in speed between q4 and q8 on the Mac.

Also, for those asking if I'm using Metal, please see this comment.

TheProfessor 155b q8_0 @ 7,803 context / 399 token response:

1.84 ms per token sample
22.69 ms per token prompt eval
404.04 ms per token eval
1.18 tokens/sec
339.19 second response

TheProfessor 155b q8_0 @ 3,471 context / 400 token response:

1.80 ms per token sample
22.46 ms per token prompt eval
328.83 ms per token eval
1.90 tokens/sec
210.62 second response

Miqu-1-120b q8_0 @ 15,179 context / 450 token response:

1.76 ms per token sample
19.23 ms per token prompt eval
423.38 ms per token eval
0.91 tokens/sec
494.04 second response

Miqu-1-120b q8_0 @ 7,803 context / 399 token response:

1.81 ms per token sample
17.80 ms per token prompt eval
314.49 ms per token eval
1.50 tokens/sec
265.41 second response

Miqu-1-120b q8_0 @ 3,471 context / 433 token response:

1.75 ms per token sample
17.83 ms per token prompt eval
256.47 ms per token eval
2.48 tokens/sec
174.48 second response

Miqu 70b q5_K_M @ 32,302 context / 450 token response:

1.73 ms per token sample
16.42 ms per token prompt eval
384.97 ms per token eval
0.64 tokens/sec
705.03 second response

Miqu 70b q5_K_M @ 15,598 context / 415 token response:

1.01 ms per token sample
10.89 ms per token prompt eval
240.51 ms per token eval
1.49 tokens/sec
278.46 second response

Miqu 70b q5_K_M @ 7,703 context / 399 token response:

1.83 ms per token sample
12.33 ms per token prompt eval
175.78 ms per token eval
2.38 tokens/sec
167.57 second response

Miqu 70b q5_K_M @ 3,471 context / 415 token response:

1.79 ms per token sample
12.11 ms per token prompt eval
142.40 ms per token eval
4.05 tokens/sec
102.47 second response

Yi 34b 200k q8_0 @ 52,353 context / 415 token response:

3.49 ms per token sample
11.59 ms per token prompt eval
370.55 ms per token eval
0.54 tokens/sec
763.27 second response

Yi 34b 200k q8_0 @ 30,991 context / 415 token response:

3.55 ms per token sample
7.74 ms per token prompt eval
238.55 ms per token eval
1.21 tokens/sec
341.61 second response

Yi 34b 200k q8_0 @ 14,866 context / 400 token response:

2.22 ms per token sample
5.69 ms per token prompt eval
142.81 ms per token eval
2.71 tokens/sec
147.63 second response

Yi 34b 200k q8_0 @ 3,967 context / 393 token response:

3.50 ms per token sample
5.01 ms per token prompt eval
84.86 ms per token eval
7.06 tokens/sec
55.63 second response

Llama 2 13b q8_0 @ 7,748 context / 441 token response:

1.81 ms per token sample
2.13 ms per token prompt eval
49.54 ms per token eval
11.03 tokens/sec
39.97 second response

Llama 2 13b q8 @ 3,584 context / 412 token response:

0.10 ms per token sample
2.00 ms per token prompt eval
38.04 ms per token eval
16.01 tokens/sec
31.98 second response

Mistral 7b q8_0 @ 30,852 context / 415 token response:

1.77 ms per token sample
1.99 ms per token prompt eval
68.31 ms per token eval
4.53 tokens/sec
91.55 second response

Mistral 7b q8_0 @ 15,241 context / 415 token response:

1.82 ms per token sample
1.41 ms per token prompt eval
42.32 ms per token eval
10.21 tokens/sec
40.65 second response

Mistral 7b q8_0 @ 7,222 context / 415 token response:

1.81 ms per token sample
1.21 ms per token prompt eval
29.05 ms per token eval
18.62 tokens/sec
22.29 second response

Mistral 7b q8_0 @ 3,291 context / 415 token response:

1.78 ms per token sample
1.15 ms per token prompt eval
22.52 ms per token eval
28.47 tokens/sec
14.58 second response

EDIT: Re-ran some of the smaller responses to bring them closer to 400-500 tokens, as they made the numbers look weird. Also re-ran the 55k Yi 34b, as something wasn't right about it.

EDIT 2: In case anyone was curious, here are some q4 number comparisons.

120b

Miqu-1-120b q8_0 @ 15,179 context / 450 response: 0.91 tokens/s, 494.04 second response
Miqu-1-120b q4_K_M @ 15,798 context / 450 response: 0.89 tokens/s, 503.75 second response

34b

Yi 34b 200k q8_0 @ 14,866 context / 400 response: 2.71 tokens/s, 147.63 second response
Yi 34b 200k q4_K_M @ 14,783 context / 403 response: 2.74 tokens/s, 147.13 second response

Q4_K_M test full numbers

Miqu-1-120b q4_K_M @ 15,798 context / 450 token response:

1.62 ms per token sample
21.49 ms per token prompt eval
362.53 ms per token eval
0.89 tokens/sec
503.75 second response

Yi 34b 200k q4_K_M @ 14,783 context / 403 token response:

3.39 ms per token sample
6.38 ms per token prompt eval
125.88 ms per token eval
2.74 tokens/sec
147.13 second response

EDIT 3: I loaded up Koboldcpp and made use of context shifting. Here is what real world numbers look like on there, using a 120b q4 at 16k and a 70b q8 at 16k (ropescaled)

70b q8 @ 16k using Koboldcpp ContextShifting

Processing Prompt [BLAS] (14940 / 14940 tokens)
Generating (354 / 400 tokens)
(EOS token triggered!)
CtxLimit: 16042/16384, Process:163.17s (10.9ms/T = 91.56T/s), Generate:101.49s (286.7ms/T = 3.49T/s), Total:264.66s (1.34T/s)

[Context Shifting: Erased 406 tokens at position 773]
Processing Prompt [BLAS] (409 / 409 tokens)
Generating (400 / 400 tokens)
CtxLimit: 16069/16384, Process:8.38s (20.5ms/T = 48.84T/s), Generate:115.54s (288.9ms/T = 3.46T/s), Total:123.92s (3.23T/s)

[Context Shifting: Erased 848 tokens at position 773]
Processing Prompt [BLAS] (421 / 421 tokens)
Generating (271 / 400 tokens)
CtxLimit: 15491/16384, Process:8.66s (20.6ms/T = 48.60T/s), Generate:78.16s (288.4ms/T = 3.47T/s), Total:86.82s (3.12T/s)

120b q4 @ 16k using Koboldcpp ContextShifting

Processing Prompt [BLAS] (15220 / 15220 tokens)
Generating (374 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15594/16384, Process:319.71s (21.0ms/T = 47.61T/s), Generate:148.74s (397.7ms/T = 2.51T/s), Total:468.44s (0.80T/s)

Processing Prompt [BLAS] (464 / 464 tokens)
Generating (321 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15983/16384, Process:14.87s (32.1ms/T = 31.20T/s), Generate:128.96s (401.8ms/T = 2.49T/s), Total:143.84s (2.23T/s)

[Context Shifting: Erased 721 tokens at position 780]
Processing Prompt [BLAS] (387 / 387 tokens)
Generating (394 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15700/16384, Process:13.32s (34.4ms/T = 29.06T/s), Generate:158.31s (401.8ms/T = 2.49T/s), Total:171.62s (2.30T/s)

Here Are Some Real World Speeds For the Mac M2 Ultra, In Case You Were Curious

Read more

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model

Reddit woes; but a light at the end of the tunnel