NTK Scaling and Llama 2
I wonder if Llama 2 killed SuperHOTs thanks to NTK scaling.
It used to be that the superhots handled that context scaling the best, so I really liked them; especially after seeing someone recommended using NTK with them instead of Linear. Using the superhot version of an older 2048 Llama model, I did 6144 context using alpha 3/rope_base 46000 and it was super stable/worked amazingly. Whereas the regular "non-superhot" version of it really started to go off the rails after I did 4096.
It was a shame that didn't come out sooner, but I'm guessing that Llama 2 was probably the nail in the coffin? Since superhot was stretching 2048 to 8192, it's a lot less of a stretch to hit the same thing from Llama 2's base of 4096. Standard NTK rope of 2/26000 would likely do it.
Now if I could just figure out why CodeLlama's base is 1,000,000 lol