Understanding MoE Offloading
I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.
I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.
So my old Reddit post about my "unorthodox setup" went down with the reddit ship, and figured it was time for an update anyway, so I'm bringing it back. My setup has gotten more complex than I originally planned, built out piecemeal over the past 2.
RAG is really 90% a software development problem, 10% an AI problem. People overcomplicate it on the AI side a lot, but it's a $5 term for a $0.05 concept: give the LLM the answer before it responds to you. At its face, that's simple
M3 Ultra Mac Studio 512GB Speeds Qwen3 235b a22b Instruct Q8 in Llama.cpp server (~15k tokens) prompt eval time 4.60 ms per token, 217.29 tokens per second eval time 67.59 ms per token, 14.80 tokens per second total time 146863.82 ms / 15763 tokens (~5k
I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.
When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what
So this looks like it could actually be a really fun model https://huggingface.co/microsoft/UserLM-8b I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their
After 3 months, /u/reddit finally messaged me to tell me the account was permanently banned. However, the section that should contain the reason for the ban is empty. It just says Your account has been permanently banned for breaking the rules. This account has been permanently closed. To continue
Every weekend for a while I've put out a release to Wilmer on Sunday; generally a few features I was able to knock out on Saturday and test on Sunday. Almost always using either some combination of local models with Wilmer via Open WebUI, or using Gemini 2.
Someone asked me to run the mxfp4 gguf vs q8, so I figured I'd post the results here too for anyone to see. As expected mxfp4 comes out to a little over half the size of the q8, and the speed is just a bit faster. I expect
For anyone who knows of me, they know that I don't like using coding agents. I have nothing particularly against them, I just don't prefer them. I like the quality and control of workflows, in a direct chat window. You can see as much in my
On thing I've always wanted to do is have Wilmer workflows call themselves, so I can create a form of recursion within the workflows. This allows for a sort of semi-agentic behavior: repeated iterations on a problem with some breakout criteria. Now that may sound like an agent,
A few days ago Stanford dumped a whole pile of AI/ML lectures up on their youtube. They're a pretty good watch if you get bored and want to dive more into this stuff. Stanford OnlineYou can gain access to a world of education through Stanford Online, the
Don't put off unit tests. When I first started building Wilmer, I barely knew any Python, and of course I didn't have Wilmer to help me build it lol. So the early code was nothing shy of a disaster; coming from a C# background, I first
Somehow I've had it this far in life without ever actually using the site. But while I wait for one of my tickets withReddit to finally reach a human so that I can get my account back, outside of discord its one of the few places I can
So my old Reddit post about my "unorthodox setup" went down with the reddit ship, and figured it was time for an update anyway, so I'm bringing it back. My setup has gotten more complex than I originally planned, built out piecemeal over the past 2.