Slimming Down the Homelab Software Footprint

So my homelab setup post from a while back is already outdated. Not as much on the hardware part; rather the software side has consolidated dramatically.

The original setup had somewhere around 20 to 30 separate WilmerAI instances running across my network. Each one was configured for a specific purpose: coding assistance, general chat, RAG workflows, reasoning-heavy tasks, fast responses, and so on. Each instance pointed at one of my three main inference machines (the M2 Ultras and M3 Ultra). If I wanted a different usecase, I spun up a different Wilmer instance and pointed at the appropriate models on the appropriate machine.

This worked, but it was wasteful. Wilmer is lightweight at around 150 megabytes per instance, but multiply that by 25 or 30 instances and you're burning some memory. More importantly, it was fragile. If I fired off two different workflow requests that both targeted the same Mac, they could hit the LLM simultaneously and either slow down the machine massively or crash it entirely. Apple Silicon doesn't handle parallel LLM inference well at all, so I had to tiptoe around my own setup, mentally tracking which workflows were in use before triggering another one.

Two changes have collapsed this down to something far more manageable.

The first is actually a Llama.cpp change; lcpp server recently added router mode (think llama-swap), which lets a single instance manage multiple models. You start the server without specifying a model, point it at a directory of GGUF files, and then specify the model in each API request. The server handles loading, unloading, and LRU eviction automatically. For my use case, I now run two llama.cpp instances per physical machine: one for a large model (the responders) and one for a small model (the workers). Both stay loaded and pinned with mlock so there is no cold start penalty. The model field in the request tells llama.cpp which one to use. That took me from an average of 5 llama.cpp instances per machine down to 2.

By doing two lcpp instances, I can work it out so that the memory balances. I'll make sure my largest responder model leaves enough memory headroom for my largest worker model; if that combination can load side by side, then I'm golden. With the Mac's memory caching, that makes it super quick to swap models around as needed.

The second big change for me is on the Wilmer-side; specifically the multi-user support I just finished building into Wilmer.

Instead of running a separate Wilmer process for each workflow, I now run a single Wilmer instance per physical machine with multiple users configured via the --User flag. Each "user" is really just a configuration profile: a set of endpoints, presets, memory settings, and workflow folders. The front-end selects which configuration to use by setting the model field to something like chris-openwebui-m3:coding or chris-openwebui-m3:general. Wilmer parses that prefix, loads the appropriate user config, and runs the shared workflow under that configuration.

The shared workflows are also a new feature. They expose workflow folders through the /v1/models and /api/tags endpoints, so frontends like Open WebUI just see them as models in a dropdown. Selecting one tells Wilmer which workflow to run.

In multi-user mode, the username prefix determines which user's endpoints and settings get used. So bob:openwebui-coding runs the same workflow as alice:openwebui-coding (assuming both are using shared workflows), but each hits their own configured LLM backends and presets.

The result is that my M3 Ultra now has a single Wilmer instance pointed to it, serving about a dozen different shared workflows, plus Roland and a Wikipedia researcher. The M2 Ultras are set up similarly. This cleaned up a LOT of memory on the Mac mini.

Concurrency limiting is the last big item. The --concurrency flag (defaulting to 1) queues incoming requests so only one hits the LLM at a time. I can now fire off multiple requests to different workflows on the same machine without worrying about crashing anything. Wilmer queues them and processes them sequentially, meaning I no longer have to keep track of what's hitting what.

I still have separate instances for my mobile setup on the MacBook Pro. That one runs independently when I am on the road.

This is all something I've meant to do forever; this and the new memory features (like the memory condenser I mentioned in an earlier post). It's a little headache that I've put up with for years, because scoping individual users was so challenging. But after the massive refactor I did in 2025, I could finally move almost all of the workflow/user related global variables into the new execution context, be able to finally ensure there was no bleed/crossover on multi-user setups.

Up until now, Wilmer was absolutely built for 1 person running it on their own machine. Now it's finally about in a state where it can actually handle multiple people at once in a single instance appropriately.

The multi-user and concurrency features are not released yet. Shared workflows got deployed out earlier this year. The rest is coming in the next update.

I know deployments have slowed down a lot on Wilmer lately, but I haven't given up on it; it's just that it's in a spot where I can do some of the other projects I always wanted to, so I've kicked those off as well. Now my precious free time is split like 5 ways lol.