Here Are Some Real World Speeds For the Mac M2 Ultra, In Case You Were Curious
One thing I see a lot of when folks are talking about how fast a machine/gpu/whatever is that they will answer the question with this vague "I get 12 tokens per second!", which honestly doesn't feel like it gives a clear answer by itself if you get no other info.
So, with that in mind, I wanted to give actual numbers for the M2 Ultra. If you are curious what using a Mac Studio for inference looks like- here you go.
Here is some info on the setup:
- This is an M2 Ultra Mac Studio with 192GB of RAM
- I used Oobabooga for the inference program
- Except for Miqu which only had a q5, I used q8 for everything. I suspect there isn't a huge difference in speed between miqu q5 and llama2 q8.
- These are ggufs, run in llama.cpp. Oobabooga uses llama-cpp-python wrapper.
- These numbers are for first run message after a model load. NO "prefix match hit" for any of these.
- Times can vary a bit, so just keep in mind that you're seeing 1 example of each context size + model size range, and that sometimes the response token amounts are different. In other words, if we re-ran the 120b @ 16k context 10 times, we might get a bit higher or lower average time. This post is more for a "this is a general idea of what you're gonna get" rather than a hard "always expect this number".
- No, I probably won't redo this using your favorite program of choice. This was wasn't fun to put together, and I regretted deciding to do it halfway through =D I just got stubborn and wanted to see it to the end
All of the tests from 120b to 7b were done with the base settings of 147GB max VRAM. However, to perform the 155b test, I ran the command "sudo sysctl iogpu.wired_limit_mb=170000
" to increase my max vram to 170GB. That is the only model performed while the system had this command active. Not sure if it affects speed, but wanted to point that out.
UPDATE: I added two q4 comparisons to the bottom. I use q8 because there is very little difference in speed between q4 and q8 on the Mac.
Also, for those asking if I'm using Metal, please see this comment.
TheProfessor 155b q8_0 @ 7,803 context / 399 token response:
- 1.84 ms per token sample
- 22.69 ms per token prompt eval
- 404.04 ms per token eval
- 1.18 tokens/sec
- 339.19 second response
TheProfessor 155b q8_0 @ 3,471 context / 400 token response:
- 1.80 ms per token sample
- 22.46 ms per token prompt eval
- 328.83 ms per token eval
- 1.90 tokens/sec
- 210.62 second response
Miqu-1-120b q8_0 @ 15,179 context / 450 token response:
- 1.76 ms per token sample
- 19.23 ms per token prompt eval
- 423.38 ms per token eval
- 0.91 tokens/sec
- 494.04 second response
Miqu-1-120b q8_0 @ 7,803 context / 399 token response:
- 1.81 ms per token sample
- 17.80 ms per token prompt eval
- 314.49 ms per token eval
- 1.50 tokens/sec
- 265.41 second response
Miqu-1-120b q8_0 @ 3,471 context / 433 token response:
- 1.75 ms per token sample
- 17.83 ms per token prompt eval
- 256.47 ms per token eval
- 2.48 tokens/sec
- 174.48 second response
Miqu 70b q5_K_M @ 32,302 context / 450 token response:
- 1.73 ms per token sample
- 16.42 ms per token prompt eval
- 384.97 ms per token eval
- 0.64 tokens/sec
- 705.03 second response
Miqu 70b q5_K_M @ 15,598 context / 415 token response:
- 1.01 ms per token sample
- 10.89 ms per token prompt eval
- 240.51 ms per token eval
- 1.49 tokens/sec
- 278.46 second response
Miqu 70b q5_K_M @ 7,703 context / 399 token response:
- 1.83 ms per token sample
- 12.33 ms per token prompt eval
- 175.78 ms per token eval
- 2.38 tokens/sec
- 167.57 second response
Miqu 70b q5_K_M @ 3,471 context / 415 token response:
- 1.79 ms per token sample
- 12.11 ms per token prompt eval
- 142.40 ms per token eval
- 4.05 tokens/sec
- 102.47 second response
Yi 34b 200k q8_0 @ 52,353 context / 415 token response:
- 3.49 ms per token sample
- 11.59 ms per token prompt eval
- 370.55 ms per token eval
- 0.54 tokens/sec
- 763.27 second response
Yi 34b 200k q8_0 @ 30,991 context / 415 token response:
- 3.55 ms per token sample
- 7.74 ms per token prompt eval
- 238.55 ms per token eval
- 1.21 tokens/sec
- 341.61 second response
Yi 34b 200k q8_0 @ 14,866 context / 400 token response:
- 2.22 ms per token sample
- 5.69 ms per token prompt eval
- 142.81 ms per token eval
- 2.71 tokens/sec
- 147.63 second response
Yi 34b 200k q8_0 @ 3,967 context / 393 token response:
- 3.50 ms per token sample
- 5.01 ms per token prompt eval
- 84.86 ms per token eval
- 7.06 tokens/sec
- 55.63 second response
Llama 2 13b q8_0 @ 7,748 context / 441 token response:
- 1.81 ms per token sample
- 2.13 ms per token prompt eval
- 49.54 ms per token eval
- 11.03 tokens/sec
- 39.97 second response
Llama 2 13b q8 @ 3,584 context / 412 token response:
- 0.10 ms per token sample
- 2.00 ms per token prompt eval
- 38.04 ms per token eval
- 16.01 tokens/sec
- 31.98 second response
Mistral 7b q8_0 @ 30,852 context / 415 token response:
- 1.77 ms per token sample
- 1.99 ms per token prompt eval
- 68.31 ms per token eval
- 4.53 tokens/sec
- 91.55 second response
Mistral 7b q8_0 @ 15,241 context / 415 token response:
- 1.82 ms per token sample
- 1.41 ms per token prompt eval
- 42.32 ms per token eval
- 10.21 tokens/sec
- 40.65 second response
Mistral 7b q8_0 @ 7,222 context / 415 token response:
- 1.81 ms per token sample
- 1.21 ms per token prompt eval
- 29.05 ms per token eval
- 18.62 tokens/sec
- 22.29 second response
Mistral 7b q8_0 @ 3,291 context / 415 token response:
- 1.78 ms per token sample
- 1.15 ms per token prompt eval
- 22.52 ms per token eval
- 28.47 tokens/sec
- 14.58 second response
EDIT: Re-ran some of the smaller responses to bring them closer to 400-500 tokens, as they made the numbers look weird. Also re-ran the 55k Yi 34b, as something wasn't right about it.
EDIT 2: In case anyone was curious, here are some q4 number comparisons.
120b
- Miqu-1-120b q8_0 @ 15,179 context / 450 response: 0.91 tokens/s, 494.04 second response
- Miqu-1-120b q4_K_M @ 15,798 context / 450 response: 0.89 tokens/s, 503.75 second response
34b
- Yi 34b 200k q8_0 @ 14,866 context / 400 response: 2.71 tokens/s, 147.63 second response
- Yi 34b 200k q4_K_M @ 14,783 context / 403 response: 2.74 tokens/s, 147.13 second response
Q4_K_M test full numbers
Miqu-1-120b q4_K_M @ 15,798 context / 450 token response:
- 1.62 ms per token sample
- 21.49 ms per token prompt eval
- 362.53 ms per token eval
- 0.89 tokens/sec
- 503.75 second response
Yi 34b 200k q4_K_M @ 14,783 context / 403 token response:
- 3.39 ms per token sample
- 6.38 ms per token prompt eval
- 125.88 ms per token eval
- 2.74 tokens/sec
- 147.13 second response
EDIT 3: I loaded up Koboldcpp and made use of context shifting. Here is what real world numbers look like on there, using a 120b q4 at 16k and a 70b q8 at 16k (ropescaled)
70b q8 @ 16k using Koboldcpp ContextShifting
Processing Prompt [BLAS] (14940 / 14940 tokens)
Generating (354 / 400 tokens)
(EOS token triggered!)
CtxLimit: 16042/16384, Process:163.17s (10.9ms/T = 91.56T/s), Generate:101.49s (286.7ms/T = 3.49T/s), Total:264.66s (1.34T/s)
[Context Shifting: Erased 406 tokens at position 773]
Processing Prompt [BLAS] (409 / 409 tokens)
Generating (400 / 400 tokens)
CtxLimit: 16069/16384, Process:8.38s (20.5ms/T = 48.84T/s), Generate:115.54s (288.9ms/T = 3.46T/s), Total:123.92s (3.23T/s)
[Context Shifting: Erased 848 tokens at position 773]
Processing Prompt [BLAS] (421 / 421 tokens)
Generating (271 / 400 tokens)
CtxLimit: 15491/16384, Process:8.66s (20.6ms/T = 48.60T/s), Generate:78.16s (288.4ms/T = 3.47T/s), Total:86.82s (3.12T/s)
120b q4 @ 16k using Koboldcpp ContextShifting
Processing Prompt [BLAS] (15220 / 15220 tokens)
Generating (374 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15594/16384, Process:319.71s (21.0ms/T = 47.61T/s), Generate:148.74s (397.7ms/T = 2.51T/s), Total:468.44s (0.80T/s)
Processing Prompt [BLAS] (464 / 464 tokens)
Generating (321 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15983/16384, Process:14.87s (32.1ms/T = 31.20T/s), Generate:128.96s (401.8ms/T = 2.49T/s), Total:143.84s (2.23T/s)
[Context Shifting: Erased 721 tokens at position 780]
Processing Prompt [BLAS] (387 / 387 tokens)
Generating (394 / 400 tokens)
(EOS token triggered!)
CtxLimit: 15700/16384, Process:13.32s (34.4ms/T = 29.06T/s), Generate:158.31s (401.8ms/T = 2.49T/s), Total:171.62s (2.30T/s)