WilmerAI

My Unorthodox Homelab Setup: Updated

So my old Reddit post about my "unorthodox setup" went down with the reddit ship, and figured it was time for an update anyway, so I'm bringing it back. My setup has gotten more complex than I originally planned, built out piecemeal over the past 2.5 years, but it's been a lot of fun, and has opened a lot of options for testing things out with Wilmer.

Some folks spend a lot of money on their cars or traveling or whatnot. I buy computers to develop on/with. If you asked me to justify the expense vs, say, an expensive API subscription? I couldn't do it. I could probably get 8 years of Gemini Pro's top level subscription out of what I've spent on this. It's just fun, and I've learned a lot thanks to it. I'm very fortunate to have had the option to do this.

Hardware

Lets start with the hardware. What I'm currently running:

A Refurbished M2 Ultra 192GB Mac Studio (disconnected from the net)
- Uses about 400W max power
- Running Qwen 235b a22b Instruct 2507 q4_K_M
- Running Qwen2.5 32b VL in KoboldCpp

A second refurbished M2 Ultra 192GB Mac Studio (also disconnected from the net)
- Uses about 400W max power
- Running GLM 4.5 Air q8_0
- Running Magistral 24b for VL capability
- This is the Shared AI box for the house; for myself, my wife, guests, etc. Has its own separate instances of Open WebUI, Wilmer, Offline Wikipedia Api, etc.
- Acts as the utility Wilmer "core"

An M3 Ultra 512GB Mac Studio (disconnected from the net)
- Also about 400W max power
- Running GLM 4.5 full in Llama.cpp server
- Running Qwen3 30b a3b Instruct 2507 in Llama.cpp server
- Running Qwen3 30b Coder Instruct 2507 in Llama.cpp server

A Refurbished MacBook Pro M2 Max 96GB (disconnected from the net)
- About 140W max power
- Running GPT-OSS-120b in Llama.cpp server
- This is still the Wilmer mobile "core"
  - This core is disabled when at home, except for SomeOddCodeBot, which swaps to a "at home" configuration. When Im on the road, I don't run GPT-OSS-120b. The road configuration is Qwen3-30b instruct and code, with a small vision llm.
- SomeOddCodeBot runs here, with two "modes": one for home and one for the road.

RTX 4090 Nvidia added into an older gaming machine (PC is itself is about 4 years old; DDR4 memory and old SSD drives).
- 800W PSU, but never draws more than 500W from the wall based on the battery backup its connected to
- Windows 10 Development machine; runs small test and agent models (shout out to Parker on LocalLlama for sending me down a vLLM rabbit hole on this lately lol)

Three Cheap Windows 11 Mini-PCs from Amazon (too small and weak to run LLMs)
- One is the home "core", and serves household Wilmers, Open WebUIs and SillyTaverns for internal models. Disconnected from the net.
  - This replaced the now retired Mac Mini.
  - Roland runs here
- One serves an Open WebUI instance connected to proprietary APIs (Mistral, Claude, OpenAI). On the net.
- One is a VPN box for web project development; was used for a YouTube video recently

Total Max Power Draw: (400 * 3) + 140 + 500 + (60 * 3) == 1920 Watts max draw combined

Idle power draw is likely somewhere in the area of 200-300W for all the machines combined

Software

On the software side:

I have three Wilmer "cores": one for my own use, one for shared use, and one for mobile.
I currently have around 20-25 separate Wilmer instances running on the home cores.
- Some are straight forward workflows for hitting a specific model, so that I have the option to use all the available models independently for different answers
- Some are wiki workflows
- Some are complex workflows that iterate a single model multiple times to confirm and validate results
- Some are more complex workflows that utilize multiple models across multiple machines in order to validate results, and also get better speed. For example- big model does work then little models validate.
I'm running 6 Open WebUI instances in the house.
- Four are for my personal development
- One with accounts for myself and my wife as a kind of offline household GPT.
- One is for proprietary models connected to the internet.
I've also got 3 SillyTavern instances
- One for RolandAI
- One for SomeOddCodeBot
- One is currently down for maintenance, but it's for an assistant I built for my wife (the bot is based on an older but more stable version of Roland).

The vast majority of my configurations use the models with reasoning disabled. I'm really not a fan of reasoning in general; I've got workflows for that. I'd rather tell it exactly how to think through a problem.

In terms of Roland: it's currently powered by all of the models. I call it an assistant, but it's more like an abuse of decision trees; ton of branching nested workflows, each one hitting different LLMs and performing different steps to work through problems. Early versions of Roland were based mostly on Wilmer's central semantic router, but it's gone way beyond that. Now it uses dozens of branching routes that lead to more routes and those lead to more routes, etc.

Funny thing is: before I started working on Wilmer, I used to really dislike those "chicken soup for the soul" style posts you see on like Medium and LinkedIn with "5-steps on how to think logically about this type of problem!", but now I find them super valuable for setting up these bots. The whole reason I like Workflows over something like pure agents is because I can hold an LLM's hand while it solves a problem, rather than hoping it figures it out itself. Roland has a lot of these kinds of things, with the goal of making the ultimate rubber duck to bounce ideas off of, and to enforce thinking through problems in a way as close to me as I can achieve with LLMs. Eventually I want to use Roland for more complex tasks that I would otherwise handle myself.

SomeOddCodeBot is a lot more simple, built kind of like the example user assistant with vector memory but with a few more routes. It's really just a quick rubber duck to make sure I didn't do anything egregiously dumb or not think through something.

Eventually I'll write a more thorough writeup on some of what's gone into Roland, but I'm just never happy with it enough to want to say "This is what I'm showing off". With that said, I do think the current iteration of it is getting much closer.

Anyhow, that's the setup. I'll update again at some point in the future when there's more. It's come a LONG way over the years, and I'm not done yet. The day an M5 Ultra with matmul drops, I'll probably sell a bunch of it and start over lol.

My Unorthodox Homelab Setup: Updated

Hardware

Software

Read more

I haven't disappeared...

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model