Real World Speeds on the Mac: Koboldcpp Context Shift Edition!

Previous Post

Here are some real-world speeds for the Mac M2

Introduction

In my previous post, I showed the raw real-world numbers of what non-cached response times would look like for a Mac Studio M2 Ultra. My goal was to demonstrate how well the machine really handles models at full and large context. However, this wasn't a particularly fair view of the Mac, since very few people will be sending large context requests over and over without anything cached. Additionally, there are some great tools available to speed up inference, so those numbers represented more of a worst-case scenario.

Now, I offer a follow-up - this time using Koboldcpp with context shifting to show a best-case scenario. Since the UI for Kobold isn't quite my cup of tea, and so many people use SillyTavern, I grabbed that as my front end. I filled up my clipboard and set off to bombard "Coding Sensei" with a wall of text as it's never seen before.

This post is divided into two parts:

  1. Part 1: The Results
  2. Part 2: A quick tutorial on installing Koboldcpp on a Mac

Setup

  • Hardware: M2 Ultra Mac Studio with 192GB of RAM. I ran the sudo command to bump usable VRAM from 147GB to 170GB.
  • Software: Koboldcpp backend with context shift enabled, SillyTavern front end.
  • Method: Bombarding Coding Sensei with walls of text, aiming for ~400 token responses from the AI to keep results consistent.
  • Settings: I cranked the temperature up to 5 to achieve consistent response lengths.
  • Note: My responses to the AI were short. If you write novels as responses, add a few seconds to each of these results. Prompt evaluation is fast enough that writing 400 tokens really isn't adding a lot of overhead. It's reading thousands of tokens plus the writing that takes the longest.

Important Note: The first message of each test represents no cache (fresh from load), just like my previous post, so those numbers are similar. The next 2-3 messages use context shifting and will be much faster.


Part 1: The Results

TheProfessor 155b q8 @ 8k

Initial Request (No Cache):

  • Context: 7,914/8,192 tokens
  • Prompt Processing: 167.77s (22.3ms/T = 44.79T/s)
  • Generation: 158.95s (397.4ms/T = 2.52T/s)
  • Total: 326.72s (1.22T/s)

With Context Shifting:

  • Context: 7,856/8,192 tokens (Erased 475 tokens at position 818)
  • Prompt Processing: 8.66s (234.0ms/T = 4.27T/s)
  • Generation: 160.64s (401.6ms/T = 2.49T/s)
  • Total: 169.30s (2.36T/s)

With Context Shifting:

  • Context: 7,928/8,192 tokens (Erased 328 tokens at position 818)
  • Prompt Processing: 8.73s (242.4ms/T = 4.12T/s)
  • Generation: 160.53s (401.3ms/T = 2.49T/s)
  • Total: 169.26s (2.36T/s)

Miqu-1-120b q8 @ 32k

Initial Request (No Cache):

  • Context: 32,484/32,768 tokens
  • Prompt Processing: 778.50s (24.2ms/T = 41.39T/s)
  • Generation: 177.64s (670.3ms/T = 1.49T/s)
  • Total: 956.15s (0.28T/s)

With Context Shifting:

  • Context: 32,621/32,768 tokens (Erased 308 tokens at position 4,356)
  • Prompt Processing: 8.47s (184.2ms/T = 5.43T/s)
  • Generation: 270.96s (677.4ms/T = 1.48T/s)
  • Total: 279.43s (1.43T/s)

With Context Shifting:

  • Context: 32,397/32,768 tokens (Erased 495 tokens at position 4,364)
  • Prompt Processing: 7.79s (251.3ms/T = 3.98T/s)
  • Generation: 171.01s (678.6ms/T = 1.47T/s)
  • Total: 178.80s (1.41T/s)

With Context Shifting:

  • Context: 32,545/32,768 tokens (Erased 274 tokens at position 4,364)
  • Prompt Processing: 9.61s (100.1ms/T = 9.99T/s)
  • Generation: 222.12s (679.3ms/T = 1.47T/s)
  • Total: 231.73s (1.41T/s)

Miqu-1-120b q8 @ 16k

Initial Request (No Cache):

  • Context: 15,690/16,384 tokens
  • Prompt Processing: 292.33s (18.9ms/T = 52.82T/s)
  • Generation: 103.08s (415.6ms/T = 2.41T/s)
  • Total: 395.41s (0.63T/s)

With Context Shifting:

  • Context: 16,130/16,384 tokens
  • Prompt Processing: 7.51s (183.1ms/T = 5.46T/s)
  • Generation: 168.53s (421.3ms/T = 2.37T/s)
  • Total: 176.04s (2.27T/s)

With Context Shifting:

  • Context: 16,116/16,384 tokens (Erased 349 tokens at position 811)
  • Prompt Processing: 6.93s (216.5ms/T = 4.62T/s)
  • Generation: 160.45s (425.6ms/T = 2.35T/s)
  • Total: 167.38s (2.25T/s)

Miqu-1-120b q8 @ 4k

Initial Request (No Cache):

  • Context: 3,715/4,096 tokens
  • Prompt Processing: 60.47s (17.7ms/T = 56.56T/s)
  • Generation: 74.97s (254.1ms/T = 3.94T/s)
  • Total: 135.43s (2.18T/s)

With Context Shifting:

  • Context: 3,567/4,096 tokens (Erased 573 tokens at position 820)
  • Prompt Processing: 6.60s (254.0ms/T = 3.94T/s)
  • Generation: 102.83s (257.1ms/T = 3.89T/s)
  • Total: 109.43s (3.66T/s)

With Context Shifting:

  • Context: 3,810/4,096 tokens
  • Prompt Processing: 8.21s (65.2ms/T = 15.35T/s)
  • Generation: 59.73s (256.4ms/T = 3.90T/s)
  • Total: 67.94s (3.43T/s)

Miqu-1-70b q5_K_M @ 32k

Initial Request (No Cache):

  • Context: 32,600/32,768 tokens
  • Prompt Processing: 526.17s (16.3ms/T = 61.20T/s)
  • Generation: 152.02s (380.0ms/T = 2.63T/s)
  • Total: 678.19s (0.59T/s)

With Context Shifting:

  • Context: 32,619/32,768 tokens (Erased 367 tokens at position 4,361)
  • Prompt Processing: 2.93s (104.8ms/T = 9.55T/s)
  • Generation: 153.93s (384.8ms/T = 2.60T/s)
  • Total: 156.86s (2.55T/s)

With Context Shifting:

  • Context: 32,473/32,768 tokens (Erased 489 tokens at position 4,356)
  • Prompt Processing: 2.95s (117.9ms/T = 8.48T/s)
  • Generation: 122.64s (384.5ms/T = 2.60T/s)
  • Total: 125.59s (2.54T/s)

Miqu-1-70b q5_K_M @ 8k

Initial Request (No Cache):

  • Context: 7,893/8,192 tokens
  • Prompt Processing: 93.14s (12.4ms/T = 80.67T/s)
  • Generation: 65.07s (171.7ms/T = 5.82T/s)
  • Total: 158.21s (2.40T/s)

With Context Shifting:

  • Context: 7,709/8,192 tokens (Erased 475 tokens at position 818)
  • Prompt Processing: 2.71s (44.4ms/T = 22.50T/s)
  • Generation: 49.72s (173.8ms/T = 5.75T/s)
  • Total: 52.43s (5.46T/s)

With Context Shifting:

  • Context: 8,063/8,192 tokens (Erased 72 tokens at position 811)
  • Prompt Processing: 2.36s (76.0ms/T = 13.16T/s)
  • Generation: 69.14s (174.6ms/T = 5.73T/s)
  • Total: 71.50s (5.54T/s)

Nous-Capybara 34b q8 @ 65k (this completely broke context shifting)

Initial Request (No Cache):

  • Context: 61,781/65,536 tokens
  • Prompt Processing: 794.56s (12.9ms/T = 77.25T/s)
  • Generation: 170.37s (425.9ms/T = 2.35T/s)
  • Total: 964.93s (0.41T/s)

Second Request (Context Shifting Failed):

  • Context: 61,896/65,536 tokens
  • Prompt Processing: 799.03s (13.3ms/T = 75.21T/s)
  • Generation: 170.72s (426.8ms/T = 2.34T/s)
  • Total: 969.75s (0.41T/s)

Nous-Capybara 34b q8 @ 32k

Initial Request (No Cache):

  • Context: 30,646/32,768 tokens
  • Prompt Processing: 232.20s (7.7ms/T = 130.41T/s)
  • Generation: 86.04s (235.7ms/T = 4.24T/s)
  • Total: 318.24s (1.15T/s)

With Context Shifting:

  • Context: 30,462/32,768 tokens (Erased 354 tokens at position 4,038)
  • Prompt Processing: 1.78s (66.1ms/T = 15.13T/s)
  • Generation: 34.60s (237.0ms/T = 4.22T/s)
  • Total: 36.38s (4.01T/s)

With Context Shifting:

  • Context: 30,799/32,768 tokens (Erased 71 tokens at position 4,032)
  • Prompt Processing: 1.78s (74.2ms/T = 13.48T/s)
  • Generation: 92.29s (238.5ms/T = 4.19T/s)
  • Total: 94.07s (4.11T/s)

With Context Shifting:

  • Context: 30,570/32,768 tokens (Erased 431 tokens at position 4,038)
  • Prompt Processing: 1.80s (89.8ms/T = 11.13T/s)
  • Generation: 44.03s (238.0ms/T = 4.20T/s)
  • Total: 45.82s (4.04T/s)

Nous-Capybara 34b q8 @ 8k

Initial Request (No Cache):

  • Context: 5,469/8,192 tokens
  • Prompt Processing: 26.71s (5.0ms/T = 198.32T/s)
  • Generation: 16.08s (93.5ms/T = 10.70T/s)
  • Total: 42.79s (4.02T/s)

With Context Shifting:

  • Context: 5,745/8,192 tokens
  • Prompt Processing: 1.56s (40.0ms/T = 24.98T/s)
  • Generation: 22.75s (94.8ms/T = 10.55T/s)
  • Total: 24.32s (9.87T/s)

With Context Shifting:

  • Context: 6,160/8,192 tokens
  • Prompt Processing: 1.42s (74.7ms/T = 13.39T/s)
  • Generation: 38.70s (96.8ms/T = 10.33T/s)
  • Total: 40.12s (9.97T/s)

Llama 2 13b q8 @ 8k

Initial Request (No Cache):

  • Context: 6,435/8,192 tokens
  • Prompt Processing: 12.56s (2.1ms/T = 487.66T/s)
  • Generation: 13.94s (45.2ms/T = 22.10T/s)
  • Total: 26.50s (11.62T/s)

With Context Shifting:

  • Context: 6,742/8,192 tokens
  • Prompt Processing: 0.69s (22.9ms/T = 43.67T/s)
  • Generation: 12.82s (46.1ms/T = 21.69T/s)
  • Total: 13.51s (20.58T/s)

With Context Shifting:

  • Context: 7,161/8,192 tokens
  • Prompt Processing: 0.67s (31.7ms/T = 31.58T/s)
  • Generation: 18.86s (47.1ms/T = 21.21T/s)
  • Total: 19.52s (20.49T/s)

Mistral 7b q8 @ 32k

Initial Request (No Cache):

  • Context: 31,125/32,768 tokens
  • Prompt Processing: 59.73s (1.9ms/T = 514.38T/s)
  • Generation: 27.37s (68.4ms/T = 14.61T/s)
  • Total: 87.11s (4.59T/s)

With Context Shifting:

  • Context: 31,082/32,768 tokens (Erased 347 tokens at position 4,166)
  • Prompt Processing: 0.52s (25.9ms/T = 38.61T/s)
  • Generation: 23.68s (68.8ms/T = 14.53T/s)
  • Total: 24.19s (14.22T/s)

With Context Shifting:

  • Context: 31,036/32,768 tokens (Erased 467 tokens at position 4,161)
  • Prompt Processing: 0.52s (21.7ms/T = 46.15T/s)
  • Generation: 27.61s (69.0ms/T = 14.49T/s)
  • Total: 28.13s (14.22T/s)

Metal Configuration Details

For those wondering if Metal was being used:

llm_load_tensors: offloading 180 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 181/181 layers to GPU
llm_load_tensors:        CPU buffer size =   265.64 MiB
llm_load_tensors:      Metal buffer size = 156336.93 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:32000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      Metal KV buffer size =  5816.25 MiB
llama_new_context_with_model: KV self size  = 5816.25 MiB, K (f16): 2908.12 MiB, V (f16): 2908.12 MiB
llama_new_context_with_model:        CPU input buffer size   =    68.36 MiB
llama_new_context_with_model:      Metal compute buffer size =  2228.32 MiB
llama_new_context_with_model:        CPU compute buffer size =    32.00 MiB

Part 2: Installing KoboldCpp on the Mac

Here is a step-by-step guide for loading Koboldcpp. Some of these I had already done before, so I'm adding these from memory. If I missed a step, please let me know.

Step 1: Install Python

Download and install Python (I use Python 3.11, not 3.12) from python.org/downloads

Step 2: Download Koboldcpp

Go to the Koboldcpp GitHub repository and on the right side, you will see a link under "releases". As of this writing, it's koboldcpp-1.58. Download the zip file.

Step 3: Extract the Files

Unzip the downloaded file somewhere convenient. I put mine in my "Home" directory.

Step 4: Navigate to the Directory

Open "Terminal" and use the command cd to navigate to koboldcpp:

cd /Users/MyUserName/koboldcpp-1.58

Step 5: Compile

Type the following command and hit enter. Wait for a while as it completes:

make LLAMA_METAL=1

Step 6: Install Requirements

Type the following command:

python3 -m pip install -r requirements.txt

Important: I ran into a frustrating issue on this step because I kept using the command python. Once I tried python3 it worked. Regular python was missing dependencies or something.

That's it! Koboldcpp is now installed.

Running Your Model

Here's an example command to run a model:

python3 koboldcpp.py --noblas --gpulayers 200 --threads 11 --blasthreads 11 --blasbatchsize 1024 --contextsize 32768 --model /Users/MyUserName/models/miqu-1-70b.q5_K_M.gguf --quiet

Command Explanation:

  • --noblas is for speed on the Mac. BLAS is apparently slow on it, per Kobold docs, and this forces something called "Accelerate".
  • --gpulayers 200 means I don't have to think about gpulayers anymore. Going over the required number does nothing; it will just fill the maximum.
  • --threads 11: I have a 24-core processor, with 16 performance and 8 efficiency cores. Normally I'd use 16, but after reading online, I found things move a little faster with less than max. So I chose 11. Adjust based on your needs.
  • --blasthreads: I see no reason not to match --threads.
  • --blasbatchsize 1024: For those coming from Oobabooga, Kobold actually respects batch sizes, and I've found 1024 is the fastest. I didn't extensively test it, so try multiples of 256 up to 2048.
  • --contextsize: Set your desired context size.
  • --model: Path to your model file.
  • --quiet: Without this, it posts your entire prompt every time. This would have made testing difficult, so I used it.

This creates an API at port 5001 and automatically enables "listen" so it broadcasts on the network.