Wilmer and Token Management
One of the big keys to running LLMs on a Mac is token management. That's what a lot of Wilmer is built around.
Wilmer started out because I wanted to make the most of Llama 2 finetunes, but eventually its workflows became a way for me to keep overall token counts down. Macs handle large prompts slowly, and the smaller the prompts, the easier that is to deal with.
For example, consider a really long conversation with an LLM. I was working with GLM 5 on my M3 Ultra to help me set up a new Linux box in the house. I know Mac and Windows well enough, but my last true foray into Linux was 15 years ago or more, so I needed help.
Eventually I hit a point where the overall conversation was about 300 messages or more. If I had been sending the whole conversation, it would have been at least 100,000 tokens. Any standard sliding cache could keep it quick, but at the cost of losing the start of the conversation. When you're on a Mac, a 20k token prompt is already in frustrating territory, so you don't want to send much more than that. This means you'd lose 4/5 of the conversation.
You could rely solely on vector memory, but now you're playing with fire on the sliding cache, hoping you don't accidentally cause it to reset because too much context changed on it.
So with Wilmer, I've been focused on a handful of context management techniques. Some have been in it since early 2024, and some I'm adding in now.
- File memories are JSON files that tie summaries to chunks of messages. The summary prompts can be anything, so it depends on the conversation type. For the Linux conversation, I set it to capture what changes we made successfully: packages installed, configs edited, services started or stopped. The system generates these automatically every 6000 tokens or so, which keeps each chunk focused and digestible.
- Chat Summary is similar, but rolls everything into one running overview. I use this to capture the 100-mile-high view of where we're at - what the overall goal is, what phase of the project we're in, what big decisions we've made.
- Vector memories are where the LLM generates individual facts as the conversation progresses and stores them for semantic search. This is more nuanced detail about what's going on: specific commands that worked, error messages we encountered and how we fixed them, configuration values we settled on.
- Conversation condensing is the newer piece. I configured it to keep my most recent 7000 tokens as raw, untouched messages. Then it takes the next 7000 tokens after that and summarizes them with awareness of the current topic. So if we're troubleshooting a networking issue, it'll lean into preserving networking details. Everything beyond that gets rolled into a neutral summary that captures the broad strokes without topic bias. This lets me keep the immediate context sharp while still holding onto the shape of a long conversation.
On top of this, I give the LLMs persistent files they can read from and write to. Things like my speech preferences, behaviors to avoid, recent events in my life, and a persona file that defines how the AI presents itself. One problem for LLMs is losing that internal train of thought and having to re-reason what its stance or goal was each time. Not so with this. The AI can jot down notes between messages and pick up where it left off.
Separating out the image processor also lets me use a different vision model from the main thinkers, but more importantly it lets me cache previous vision responses. Once I send an image, the LLM doesn't have to reprocess it but can still answer questions about it. That's super helpful, and something that I don't see a lot of front-ends doing; after just a few messages, it stops having the context of that image.
All of this gives me the ability to have massive conversations, hundreds of messages long, while maintaining consistency in knowledge; all while barely sending 15-20k tokens to the LLM in any given message. Overall I process more tokens than if I just left it all to sliding cache, but in return I get an assistant that can continue answering questions during message 300 about something way back in the first 20 messages.
The real advantage is that I can use smaller models for most of the heavy lifting. During my Linux setup, what I really wanted was the final response from GLM 5. That's the model walking me through everything. But parsing through memories, updating summaries, deciding whether to pull from Wikipedia, condensing old conversation chunks? That gets pawned off to weaker models, sometimes down to the 4-billion-parameter range. They finish in no time at all. Then when GLM 5 kicks off, it's been handed everything it could hope for in terms of context, and it only has to work with 20k tokens or less.