A Little About Local Models

So, folks keep asking about running AI models locally, and I figured I'd just put down what I know in one place. Might help someone out there.

Short answer: For most machines, I'd try a ggml or gguf of any 7b_q4 to start, just to see if it loads. Something like "nous-hermes-llama-2-7b.Q4_0.gguf" from TheBloke/Nous-Hermes-Llama-2-7B-GGUF on HuggingFace works pretty well. Pair it with a good frontend like Oobabooga and you're set.

Now, if you want to understand what you're looking at when you see these model names, here's the breakdown:

The "b" in model names tells you the parameter count - 3b, 7b, 13b, 30b, etc. Bigger "b" means bigger file size. A raw 7b model needs about 14GB of space. And yeah, the whole thing loads into VRAM to run. More parameters generally means smarter model. 3b models aren't that smart, while 70b models are pretty advanced.

The "q" means it's quantized - basically compressed. q8 cuts the size down to a little over half. So that 7b model that takes 14GB would be about 9GB when quantized to q8. There are different levels: q8, q6, q5, q4, etc. Each step down makes the model smaller and faster but also "dumber". A 7b_q4 will make more mistakes than the raw 7B but will be much smaller.

File extensions matter too - ggml/gguf/gptq/etc. tell you what the model runs on. GGML and GGUF run off the CPU, which is slow but uses normal RAM. GPTQ and others run entirely off the GPU, using VRAM.

GGML and GGUF can be run with video card offloading, which I'd recommend. Frontends like Oobabooga have docs on how to do this. Many frontends might not do this right out of the box. GPT4All doesn't do video card offloading at all. The instructions for offloading can look scarier than they actually are.

The rest of the model name - like "nous", "hermes", "llama2" - that's the "flavor" of the model. This is mostly about personal preference, use case, etc. With a generic frontend like Oobabooga, you can use most models without extra setup. Different models respond differently, and you'll figure out these differences as you experiment.

I think that covers the basics. Once you understand these parts of model names, picking what works for your setup gets a lot easier. Start with something like a 7b_q4 in GGUF format and go from there. You'll quickly learn what works and what doesn't for your particular machine and use cases.