Ok, I admit- SillyTavern is a great way to test models after all

So after seeing a lot of folks recommending SillyTavern as a good front end for APIs, I finally decided to give it a better try. I mostly have been using Oobabooga, and while I had ST installed from many months ago, I never put a lot of time into understanding its features since it seemed more game oriented. However, I recently wanted to swap to KoboldCPP for speed thanks to its Context Shifting, and needed a good front end... so I begrudgingly updated my old ST and gave it another proper go.

Now that I've played with it, I realize that it's an excellent tool to test models quickly. What I did was grab a handful of character cards off the internet, and then stick them in a group chat and have them debate each other. I give each character a specific viewpoint.

The goal of this is several fold:

  • First is a test of context. If I have the context set to 16k, and that gets filled up, that means the specific viewpoints that I've added into the characters are kind of like the "find the needle" tests; their stance in an argument might be a single sentence somewhere in the middle of 16k context. If each character adheres to their stance, the model is handling context well.
  • Second is a test of prompt template and settings. I can redo an argument over and over, trying various settings and templates, to see if they adhere. Is the anti person staying anti? Is the pro person staying pro? Are the 'centerists' adhering to pro or anti positions? Does that change on a different prompt template?
  • Third is a test of model coherence. If even at low context the model is mixing up characters, etc, that's a big problem. Also, if the model has all the characters get along and just agree with each other, when they should be arguing, that's also a failure. This is very common; getting the model to not endlessly pat itself on the back is something almost all the Llama 2 merges I've tried have struggled with.

I've been having fun testing models. I had been playing with Frankenmerging myself, and using these tests has weeded out... well... all of the ones I merged lol. Oh well. But it's been a great "quick test" for this stuff and saved me from embarrassing myself by sharing them.

I do have to shout out, again, to /u/WolframRavenwolf for Miqu-1-120b. Once again this model has impressed me. Other models that I've tried so far really struggled to keep the characters straight, but this one adheres faithfully at 16k context to every character's position, and makes great arguments for both. It is perhaps not the most eloquent model, and the characters do sound a lot like each other, but in terms of factually handling each character's stance and viewpoints? 10/10. (I do wish that Miqu had an actual license. It kills me seeing how good Miqu based models are and not being able to do anything of actual use with them)

(EDIT: I tried base miqu-1-70b q5, and it messed this test up. So Miqu-1-120b outperformed miqu-1-70b greatly in this test. My anti person was happily agreeing with the pro person and no one was arguing. Big happy family all around lol)

But anyhow, I just wanted to throw out that this is a great quick and autonomous test. You just start a group chat, make sure the character prompts have positions for an argument baked into them, and then ask a question. Set auto mode and come back in a little bit.