A Quick-ish Rundown of LLM Basics

Over the past few days, I've realized that there are a lot of folks out there using LLMs that haven't had an opportunity to dig, even a little, into the basics of how LLMs really work. And I guess that makes sense; for the most part, the average person doesn't have a lot of reason to know this. But if you're going to be a power user, there are things that would really help you to understand.

Below are the most basic basics. Not covering everything, just some stuff that I think if you get then the rest will start to make sense for you as well. Hopefully it helps someone out there.

Tokens

When you write something to an LLM, it doesn't break that thing down by character, it breaks them down by groups of characters called "Tokens". Every LLM has its own tokenizer, so not all choose the same tokens.

Here's a real world example of what tokenization might look like using Qwen3.6 27b's tokenizer: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json. If you open that file, you'll see the full list of tokens that Qwen3.6 27b utilizes.

As for how tokens work... here's an example:

"This is a token"
- That's 15 characters

'This' 'Ġis' 'Ġa' 'Ġtoken'
- That's 4 tokens. You'll notice 'Ġ' is in each; that's what
GPT-2/GPT-3/GPT-4 use as a space in tokenization

These line up to numbers, which the LLM then uses to do matrix math to determine the right output. If we go back to the link I gave you above, then you can see the following:

This   == 1919
ĠIs    == 369
Ġa     == 264
Ġtoken == 3817

So Qwen3.6 27b would see your sentence as (1919, 369, 264, 3817). It then does matrix math and other cool pattern-y stuff to determine the best tokens to respond to you with.

So remember this when you hear that an LLM has a context window of 1,000,000 tokens: it's talking about those things. Sometimes whole words are tokens, sometimes not. Don't just assume every word is a token; they try to create tokens off the most commonly used words. This, is, a are all very common in the English language. Token is very common when talking about LLMs.

Context Windows

The way I usually describe context windows is to imagine the full Song of Ice and Fire book series printed out on one really long parchment, and you have a piece of cardboard with a window cut in it that you can read text through. All you know is whatever's currently in that window. If someone asks you about something outside the window? Tough luck, you don't know it.

Now, the obvious thought is "well just make the window bigger". The problem is that if you cut the window too big, you have a harder time finding any specific thing in there, and you start mixing details up. You've learned how to read a certain amount within that window, and pushing past that doesn't go great. If the full book was the length of a parking lot, and someone asked you for details that could exist anywhere in that whole parking lot worth of text... well, good luck.

That's pretty much how it works with LLMs. You'll see models advertise huge context windows like 1,000,000 tokens, but the real-world practical use of that is a lot smaller than the marketing implies. The bigger you stuff that window, the worse the model gets at pinpointing specific information inside it. There's a whole pile of benchmarks (needle in a haystack tests, NoLiMa, RULER, etc) showing accuracy drop as the context fills up. So a 200k token context window is not an invitation to dump 200k tokens in there and expect great results. You'll generally get a much better answer giving the model 8k of really relevant tokens than 200k of "everything I have on the topic".

To get a better visualization, check this benchmark out: https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

Scroll down to the results section and you'll see a table- the numbers in there represent how well the model pulls the right info out based on the context size it was fed. You can see that some models, like GPT-5.2 or Opus 4.6, did great all the way up to 120k (except 5.2 pro for some reason...). But look at something like minimax 2.5, for example: by the time you hit 60k tokens, you have less than a 50% chance to get all the right info you asked for.

This is a struggle a lot of us running local models deal with, and it usually means you want to account for that with a lot of great wrapper software or middleware.

Model Sizes (ie- parameters)

When we talk about models, we size them based on the number of parameters they have. 1M is a 1 Million parameter model. That's itty bitty. 1b is 1 billion parameters- also itty bitty. Many modern models release in really huge sizes like 397b to 1T (1 Trillion parameters).

The easiest way to imagine parameters is as data points that can correspond to several pieces of data at once. So 1 datapoint doesn't necessarily equate to something like "When did the first Ford car release?" It could also correspond to several other pieces of info at once.

Models are generally created in BF16 format to start with. Size wise- BF16 equates to about 2GB per 1b; so a 20b model would be 40GB. If you "quantize" the model (easiest way is to think of it is 'compressing' the model) to 8bpw, or ~q8_0, that becomes 1GB per 1b. If you go further to 4bpw, or ~q4_0, you get down to 0.5GB per 1b. That's how we fit big models on smaller hardware.

As you can imagine, the more you quantize, the more mistakes the model will likely make.

Open Weight Models

These are models that you can download and run yourself. There are a few ways to do it, and here are some examples:

Raw transformers - this is the original format of the models
GGUF - This is a model that has been converted to run in llama.cpp
MLX - This is converted to run in Apple's MLX

Many applications, like Ollama or LM Studio, wrap some of these and then have their own repositories to pull models from. For best speed and the fastest updates for model support, you generally want to avoid that. You can find all models here: https://huggingface.co.

Mixture of Experts (ie- MoE)

This section is only really relevant to Open Weight models, so you can skip this if you never plan to host your own.

Parameter count doesn't just affect knowledge, it also affects speed. The bigger the model, the more matrix math the computer has to do per token. So a 70b model running at the same quantization on the same hardware as a 7b is going to be a whole lot slower; you're doing roughly 10x the math per token. That's also why video cards handle LLMs better than CPUs: it's a lot of floating point math, and GPUs eat that up. Which means when you're trying to figure out if you can fit a model on your machine, the real question is how much you can fit into VRAM.

Up until a year or two ago, pretty much every model you used was what we call a "dense" model. Dense means every single parameter in the model gets activated for every token it produces. A 70b dense model is doing 70b worth of math, every single token.

Then Mixture of Experts (MoE) models started taking off. You'll see them named like Qwen3.5-397b-a17b, or Qwen3.6-35b-a3b. The "a" in the first one stands for "active parameters". The way MoE works is the model is split up into a bunch of smaller "experts", and for each token, a "router" picks just a few of those experts to use. So Qwen3.5-397b-a17b has 397 billion total parameters, but only 17 billion get used for any given token.

What this means in practice: an MoE model runs at roughly the speed of its active parameter count, not its total. So Qwen3.5-397b-a17b runs only a little slower than the speed of a 17b dense model, even though it has 397b worth of parameters.

That's a big deal for performance, especially on local hardware. It really made those of us who invested in Macs early very happy. I almost, ALMOST, started to regret my first Mac Studio back in 2023... then not long after Mixtral 8x7B came out and that changed everything. It's only gotten better since.

The cool thing about MoEs really is on the knowledge side. An MoE with 397b total isn't as smart as a dense 397b model would be; the smarts land somewhere in between the active count and the total count. Where exactly is debated and varies by model, but the rule of thumb is to expect noticeably better than a dense model at the active size, and nowhere near a dense model at the total size. So Qwen3.6-35b-a3b isn't going to behave like a 35b dense; it'll feel like something north of a 3b but well short of a 35b.

The other catch, and this one matters a lot if you're running locally, is that even though MoE only uses a fraction of params per token, you still have to load ALL the params into memory. That 397b model still needs somewhere around 200GB at q4 to run, even though only 17b worth is doing math at any given moment. Llama.cpp does have a clever way to offload the inactive expert layers to system RAM so you can run these things on regular gaming hardware, but that's a deeper topic. I have a whole writeup on MoE offloading if you want to go down that rabbit hole.

Training

LLMs learn by being "trained". It's a complex process that, at the absolute highest level, involves the LLM seeing billions upon billions of tokens of information and learning patterns from it. "When I see someone say this, it usually involves someone responding with that" kind of thing. This is why people constantly harp about good data in training being the most important thing- if you have really clean examples of speech, knowledge, etc, it is easier for the LLM to find the right patterns.

Eventually, more powerful LLMs start to infer new patterns that they haven't seen before. Remember the old math problems like if A == B and B == C, then A == C? Imagine that on a MASSIVE scale, where it creates connections between information many many many many layers deep to get from A to Z.

Training a commercially viable model takes ungodly amounts of money and data, and you need really smart people to do it. Companies spend millions to billions of dollars making some of the most powerful models.
Training data is hard to come by. If you've heard about how some companies scraped the internet for data? That's why. They are looking for examples of speech, knowledge, etc. When an LLM wants to train on your data, it is less that the company wants to include your personal PII in the model (they generally don't; they don't want that bad publicity if someone makes the model spit it out) and more that they want nice clean interactions to give to the LLM to look at and learn more patterns.
This is also why AI companies are mad at each other for "distilling" their products. Distilling is the act of interacting with an LLM over and over again to get examples of the LLM's speaking or thinking process, then creating training data to teach another LLM to act or reason that same way. An example of this from recently was that DeepSeek, Moonshot AI, and MiniMax got accused of doing this by Anthropic. The accusation was that they were using thousands of fraudulent accounts to interact with Claude millions of times, then using those interactions to teach their own models to think and speak similarly.
It's possible to train little fun models pretty cheaply. One guy recently trained a small model from scratch on 1800s text, with nothing at all modern in it. This little model has no concept of anything past the industrial age.

Finetuning / Post-Training

When you hear a non-tech company say they are "training a model", they most likely mean finetuning or post-training an open weight model.

Imagine an LLM as a big calculator for matrix math. Numbers go in, one number comes out. So that over and over and you get a response. The neat thing about matrix math is something called rank factorization- the idea that you can represent a matrix m*n with rank r by using smaller matrices m*r and r*n. Some super smart folks figured out that this allowed us to have LoRAs, which you can think of like add-on components to LLMs that modify the weight distribution.

In other words- rather than retraining the entire model to try to add more information, you train an itty bitty version of that model with the info you want, and then you can load the original model + LoRA at the same time to get a post-trained model.

Truthfully- I am pretty staunchly in the camp that you can't reliably train new knowledge into a model this way. That's a very common but not a universal view within the deeper LLM tinkering community; some companies have made post-training their bread and butter. I do believe that you CAN train styles, tones, etc really well into it (for example: training a model to handle documentation a certain way, or think a certain way), but ultimately I've yet to see a good example of a post-trained model outside of basic Instruct models from the same manufacturer that has actually been worth the effort. Maybe there are some out there, but I'm not familiar with them.

Anyhow, long story short- you CAN post-train a small model for $100 or less, but I wouldn't even recommend it unless you really understand what you want to get out of it and why. There's very little a post-trained model can do that you can't do with a good workflow, prompt and data to RAG against.

How LLMs Respond

When you boil it down, LLMs work in a really simple loop. You give it a chunk of tokens. It processes them and spits out one new token. Then it takes all your original tokens plus that one new token it just spit out, and processes the whole thing again, and spits out the next token. Then it takes all your tokens plus the two new tokens, processes again, spits out the next. On and on, one token at a time, until it decides it is done and sends a stop token. You now have your response.

To simplify it- LLMs don't think about the response all at once- they think 1 token at a time. Over and over and over until they are done. That's it.

This is also why "reasoning" works. If you ask a model to just answer a hard math problem cold, it can fumble it, because by the time it gets to the answer it's already locked into early tokens it picked. But if you tell it to think out loud first- write out the problem, work through it step by step- then while it's writing all that, it's still just predicting one token at a time, except now each new token gets to "see" all the work it just laid out. If it makes a mistake at step 2, it can sometimes catch it at step 4 and shift the line of thinking before it commits to a final answer.

If you ever watch an LLM think, and it constantly goes "But wait...", that's because it was trained to in order to stop it from locking in. It says its response, then it challenges the response, and in doing so that gives it a chance to realize the response was wrong.

That's basically what chain of thought and reasoning models are. The model writing out its work so it has more to reference when generating each next token. It's not magic, it's just giving the model more useful context to predict from. The flip side is that more reasoning means more tokens, which means more time and more cost. And some models, like Qwen3.5/3.6 and Gemma 4, overthink badly. With those, you want to use a workflow app to manually apply CoT, if you can. Since I use Wilmer everywhere, I have workflows specifically to use Qwen/Gemma with thinking disabled, and then have a manual CoT step. That helps with overthinking massively.

RAG - Retrieval Augmented Generation

This is a $5 term for a $0.05 concept. When we talk about RAG, it boils down to a very simple concept: give the LLM the answer before it responds. Everything else, when talking about RAG, is talking about a design pattern.

Simplest example: The simplest form of RAG would be copying the text of an article or tutorial, putting it in your prompt, and asking the LLM to answer a question about that. The LLM will use the article to answer you.
Next level of simplicity: You might ask an LLM a question, the LLM uses a tool (web search, local wiki search, whatever) to pull the article, concatenates it into your prompt, and answers your question.
What a lot of folks think of when they think of RAG: You have a program that takes thousands, or even millions, of documents and turns them into "embeddings"- ie breaks the document into logical chunks and stores them somewhere easy to retrieve off of, such as a Vector database. Then, when you ask a question, it does some fancy stuff in the background to find the right chunks and answer your question with them. Since putting 1,000,000 files into your context all at once is impossible, this is how you go about the oft-advertised "chat with your documents" situation.

But all together, RAG comes down to a very simple concept: give the LLM the answer before it responds. That's it. LLMs are very, very strong at this, and it's a great way to avoid hallucinations.

For the most part, RAG solutions are not an LLM problem, they're a software problem. If you're struggling with RAG, you probably need to revisit HOW you're feeding the data to your LLM and whether you're giving it too much unnecessary stuff along with the right stuff.

Hallucinations

A hallucination is when the LLM responds with something that's flat wrong. The reason it happens comes back to that loop in the How LLMs Respond section: an LLM doesn't actually know anything. It's a pattern matcher predicting the most likely next token based on what came before, based on the training that it did to determine "when I see X, I usually see a response of Y". If the most likely next token happens to be the wrong one, well, that's what you get. This can especially happen with information that there isn't a lot of great data out there for, so the LLM had to infer the relationships. Asking a detailed question about Excel means it has millions of example questions, articles, documents, etc from the internet to have learned from; asking a question about FIS' Relius Administration has far far fewer examples, so it likely inferred a lot of things based on other patterns, and it will hallucinate like mad.

LLMs, as a technology, don't have a built-in "I'm not sure about this" lever they can pull. It just generates whatever the patterns say to generate, and confidence isn't really part of the equation. The answer it gave you is 'right' from the perspective that it generated the most likely pattern. Whether that pattern is of any use to you has nothing to do with the LLM lol.

The most common reasons you see hallucinations:

The training data was wrong, so the pattern the model learned is wrong.
The training data didn't cover the topic well, so the model is filling in gaps with whatever sounds plausible.
You asked something outside what the model was really trained for, and it tries to answer anyway because that's what it was trained to do- give an answer.
Your context window is huge or messy, and the model is losing track of what's actually relevant in there.
The model is over-quantized and just making more mistakes generally (going back to that earlier section).

Reasoning models hallucinate a bit less on certain types of problems because they get a chance to second-guess themselves while writing things out, but they absolutely still hallucinate. The single best mitigation is to put the answer in the context for it, which is RAG.

Using That Info

Knowing all this should hopefully help you start to narrow down why some of the "pro tips" of using LLMs exist. When you want a factual answer, you don't just ask the LLM. Right or wrong, you're getting a confident response. Instead, make sure you are injecting the right answer in before it responds- this often means tool use such as web search or, even better, "Deep Research" features you find on commercial LLMs.

This also hopefully will help you imagine why jamming ALL your codebase into the LLM, or constantly asking "What model has a bigger context window?" is the wrong question. It's lazy to just look for bigger context windows; and that laziness will bite you. Instead, focus on how you can break the data apart so that the LLM can work in the confines of what it handles best. That means writing or downloading some supporting software.

Anyhow, good luck folks. Hope this helps the like 4 people that might read this far.