Running Llama 3.1 405b q6 and Command-A 111b Q8 on M3 Ultra Mac Studio

Below are benchmarks of running Llama 3.1 405b q6 and Command A 111b Q8 on an M3 Ultra 512GB using KoboldCpp

The 405b was so miserable to run that I didn't even try flash attention, and flash attention was completely broken with Command-A

M3 Ultra Llama 3.1 405b q6:

CtxLimit:12394/32768,
Amt:319/4000, Init:0.01s,
Process:535.61s (44.4ms/T = 22.54T/s),
Generate:255.33s (800.4ms/T = 1.25T/s),
Total:790.94s (0.40T/s)

M3 Ultra Llama 3.1 405b q6 with Llama 3.2 3b spec decoding:

CtxLimit:12396/32768,
Amt:321/4000, Init:0.02s,
Process:543.07s (45.0ms/T = 22.23T/s),
Generate:209.67s (653.2ms/T = 1.53T/s),
Total:752.75s (0.43T/s)

M3 Ultra 111b command-a q8:

CtxLimit:13722/32768,
Amt:303/4000, Init:0.03s,
Process:161.94s (12.1ms/T = 82.86T/s),
Generate:93.65s (309.1ms/T = 3.24T/s),
Total:255.59s (1.19T/s)

M3 Ultra 111b command-a q8 with r7b spec decoding:

CtxLimit:13807/32768,
Amt:389/4000, Init:0.04s,
Process:177.33s (13.2ms/T = 75.67T/s),
Generate:88.36s (227.1ms/T = 4.40T/s),
Total:265.68s (1.46T/s)