Benchmarks

Running Llama 3.1 405b q6 and Command-A 111b Q8 on M3 Ultra Mac Studio

Below are benchmarks of running Llama 3.1 405b q6 and Command A 111b Q8 on an M3 Ultra 512GB using KoboldCpp

The 405b was so miserable to run that I didn't even try flash attention, and flash attention was completely broken with Command-A

M3 Ultra Llama 3.1 405b q6:

CtxLimit:12394/32768,
Amt:319/4000, Init:0.01s,
Process:535.61s (44.4ms/T = 22.54T/s),
Generate:255.33s (800.4ms/T = 1.25T/s),
Total:790.94s (0.40T/s)

M3 Ultra Llama 3.1 405b q6 with Llama 3.2 3b spec decoding:

CtxLimit:12396/32768,
Amt:321/4000, Init:0.02s,
Process:543.07s (45.0ms/T = 22.23T/s),
Generate:209.67s (653.2ms/T = 1.53T/s),
Total:752.75s (0.43T/s)

M3 Ultra 111b command-a q8:

CtxLimit:13722/32768,
Amt:303/4000, Init:0.03s,
Process:161.94s (12.1ms/T = 82.86T/s),
Generate:93.65s (309.1ms/T = 3.24T/s),
Total:255.59s (1.19T/s)

M3 Ultra 111b command-a q8 with r7b spec decoding:

CtxLimit:13807/32768,
Amt:389/4000, Init:0.04s,
Process:177.33s (13.2ms/T = 75.67T/s),
Generate:88.36s (227.1ms/T = 4.40T/s),
Total:265.68s (1.46T/s)

Understanding MoE Offloading

I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.

My Next Steps with Wilmer

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what

Microsoft's New User Role Model

So this looks like it could actually be a really fun model https://huggingface.co/microsoft/UserLM-8b I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their

Reddit woes; but a light at the end of the tunnel

After 3 months, /u/reddit finally messaged me to tell me the account was permanently banned. However, the section that should contain the reason for the ban is empty. It just says Your account has been permanently banned for breaking the rules. This account has been permanently closed. To continue

Read more

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model

Reddit woes; but a light at the end of the tunnel