NTK Scaling and Llama 2

I wonder if Llama 2 killed SuperHOTs thanks to NTK scaling.

It used to be that the superhots handled that context scaling the best, so I really liked them; especially after seeing someone recommended using NTK with them instead of Linear. Using the superhot version of an older 2048 Llama model, I did 6144 context using alpha 3/rope_base 46000 and it was super stable/worked amazingly. Whereas the regular "non-superhot" version of it really started to go off the rails after I did 4096.

It was a shame that didn't come out sooner, but I'm guessing that Llama 2 was probably the nail in the coffin? Since superhot was stretching 2048 to 8192, it's a lot less of a stretch to hit the same thing from Llama 2's base of 4096. Standard NTK rope of 2/26000 would likely do it.

Now if I could just figure out why CodeLlama's base is 1,000,000 lol

I haven't disappeared...

It's been a few weeks since I've posted or made any changes on Wilmer; I haven't stopped or lost interest, but rather I'm about to change jobs and I've been heads down on transition stuff before I leave my current

Understanding MoE Offloading

I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.

My Next Steps with Wilmer

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what

Microsoft's New User Role Model

So this looks like it could actually be a really fun model https://huggingface.co/microsoft/UserLM-8b I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their

Read more

I haven't disappeared...

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model