A Little About Local Models

So, folks keep asking about running AI models locally, and I figured I'd just put down what I know in one place. Might help someone out there.

Short answer: For most machines, I'd try a ggml or gguf of any 7b_q4 to start, just to see if it loads. Something like "nous-hermes-llama-2-7b.Q4_0.gguf" from TheBloke/Nous-Hermes-Llama-2-7B-GGUF on HuggingFace works pretty well. Pair it with a good frontend like Oobabooga and you're set.

Now, if you want to understand what you're looking at when you see these model names, here's the breakdown:

The "b" in model names tells you the parameter count - 3b, 7b, 13b, 30b, etc. Bigger "b" means bigger file size. A raw 7b model needs about 14GB of space. And yeah, the whole thing loads into VRAM to run. More parameters generally means smarter model. 3b models aren't that smart, while 70b models are pretty advanced.

The "q" means it's quantized - basically compressed. q8 cuts the size down to a little over half. So that 7b model that takes 14GB would be about 9GB when quantized to q8. There are different levels: q8, q6, q5, q4, etc. Each step down makes the model smaller and faster but also "dumber". A 7b_q4 will make more mistakes than the raw 7B but will be much smaller.

File extensions matter too - ggml/gguf/gptq/etc. tell you what the model runs on. GGML and GGUF run off the CPU, which is slow but uses normal RAM. GPTQ and others run entirely off the GPU, using VRAM.

GGML and GGUF can be run with video card offloading, which I'd recommend. Frontends like Oobabooga have docs on how to do this. Many frontends might not do this right out of the box. GPT4All doesn't do video card offloading at all. The instructions for offloading can look scarier than they actually are.

The rest of the model name - like "nous", "hermes", "llama2" - that's the "flavor" of the model. This is mostly about personal preference, use case, etc. With a generic frontend like Oobabooga, you can use most models without extra setup. Different models respond differently, and you'll figure out these differences as you experiment.

I think that covers the basics. Once you understand these parts of model names, picking what works for your setup gets a lot easier. Start with something like a 7b_q4 in GGUF format and go from there. You'll quickly learn what works and what doesn't for your particular machine and use cases.

I haven't disappeared...

It's been a few weeks since I've posted or made any changes on Wilmer; I haven't stopped or lost interest, but rather I'm about to change jobs and I've been heads down on transition stuff before I leave my current

Understanding MoE Offloading

I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.

My Next Steps with Wilmer

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what

Microsoft's New User Role Model

So this looks like it could actually be a really fun model https://huggingface.co/microsoft/UserLM-8b I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their

Read more

I haven't disappeared...

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model