Generative-AI

RAG Really Is More of a Software Problem Than An AI Problem

RAG is really 90% a software development problem, 10% an AI problem. People overcomplicate it on the AI side a lot, but it's a $5 term for a $0.05 concept: give the LLM the answer before it responds to you.

At its face, that's simple enough- but too many folks take that literally, and just look for the biggest context models they can find. A lot of times that's not really what you want, though.

Say you have 300,000+ tokens of data, and an LLM that can handle 1M tokens. But then you look at context benchmarks, like FictionBench, and see that at that amount of tokens you'd be lucky if it was 60% accurate or less; meaning 40% or more of the time it gives you a wrong answer.

Moving down the scale, though, is a different story. At 16k tokens it could be 90% accurate or more; maybe even 100%, meaning it almost always gives you the right answer.

This defines your problem statement for you. If you want to rely on that data, then you need to ensure that you hand the LLM the answer it needs to respond to you in the range of tokens that gives 90-100% accuracy. If you have 300k tokens, that means you need to break your stuff up to fit in that much smaller chunks, like maybe 12-16k.

This makes RAG more of a software problem. Chunking your data, storing your data, and then having search mechanisms to find the right data to give to the LLM becomes the main factor. But that's all plain ol' development, with a bit of AI flavor added in.

You can't underestimate how important the surrounding tech on an LLM is. At the end of the day, that's going to define your experience as much or more than the model itself.

I haven't disappeared...

It's been a few weeks since I've posted or made any changes on Wilmer; I haven't stopped or lost interest, but rather I'm about to change jobs and I've been heads down on transition stuff before I leave my current

Understanding MoE Offloading

I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive.

My Next Steps with Wilmer

When I first started Wilmer, it was for a very specific reason: I wanted a semantic router, and one didn't yet exist. The routers that were available were all specifically designed to take the last message, categorize that, and route you that way. I needed more, though; what

Microsoft's New User Role Model

So this looks like it could actually be a really fun model https://huggingface.co/microsoft/UserLM-8b I like this little specific purpose LLMs the most because it opens up some neat doors. They likely made this to act as the user-proxy in autogen, and they point out on their

Read more

I haven't disappeared...

Understanding MoE Offloading

My Next Steps with Wilmer

Microsoft's New User Role Model