Twitter/X...
Somehow I've had it this far in life without ever actually using the site. But while I wait for one of my tickets withReddit to finally reach a human so that I can get my account back, outside of discord its one of the few places I can
Somehow I've had it this far in life without ever actually using the site. But while I wait for one of my tickets withReddit to finally reach a human so that I can get my account back, outside of discord its one of the few places I can
So my old Reddit post about my "unorthodox setup" went down with the reddit ship, and figured it was time for an update anyway, so I'm bringing it back. My setup has gotten more complex than I originally planned, built out piecemeal over the past 2.
Ok, for anyone else RDPing into a Windows machine from a Mac that is experiencing a latency between sound and visuals, especially when watching a video: I just went into Settings and set "Graphics Interpolation Level" under "General" to `Medium`, and it had an immediately noticeable
Everyone saying that the new iPhone isn't much, but the fact that they added dedicated MatMul acceleration into the A19 is huge, because this means we'll probably see it in the M5. For folks like me- that's a dream come true. I love my
A quick dump of the benchmarks that I look at and use personally; I've dropped a few that no longer appear to be kept up to date, and grabbed a few newer ones. Code Specific * https://www.swebench.com/ * https://swe-rebench.com/ * https://aider.chat/docs/leaderboards/ Coding
RAG is really 90% a software development problem, 10% an AI problem. People overcomplicate it on the AI side a lot, but it's a $5 term for a $0.05 concept: give the LLM the answer before it responds to you. At its face, that's simple
So over the past week I'm suddenly seeing folks posting on LinkedIn about this number that 95% of corps fail to generate significant revenue with AI projects. Honestly, I'd believe it. Personally? I think that a big reason so many AI projects die is because folks
M3 Ultra Mac Studio 512GB Speeds Qwen3 235b a22b Instruct Q8 in Llama.cpp server (~15k tokens) prompt eval time 4.60 ms per token, 217.29 tokens per second eval time 67.59 ms per token, 14.80 tokens per second total time 146863.82 ms / 15763 tokens (~5k
So back in July, while using a fairly popular commercial VPN, I made a comment on the LocalLlama sub answering someone's question by linking one of my own posts with some benchmarks; something I would do often. After a few minutes, I decided to edit the post to
Found this screenshot from back in 2023; if I remember right, CodeLlama had just come out and I was trying to see how it would do in a coding interview. But look at that old interface for text-gen. The below picture is from the internet, and shows the new interface.
Mac Model: M3 Ultra Mac Studio 512GB, 80 core GPU First- this model has a shockingly small KV Cache. If any of you saw my post about running Deepseek V3 q4_K_M, you'd have seen that the KV cache buffer in llama.cpp/koboldcpp was 157GB for
UPDATE 2025-04-13: llama.cpp has had an update that GREATLY improved the prompt processing speed. Please see the new speeds below. Deepseek V3 0324 Q4_K_M w/Flash Attention 4800 token context, responding 552 tokens CtxLimit:4744/8192, Amt:552/4000, Init:0.07s, Process:65.46s (64.02T/
Benchmarks
Below are benchmarks of running Llama 3.1 405b q6 and Command A 111b Q8 on an M3 Ultra 512GB using KoboldCpp The 405b was so miserable to run that I didn't even try flash attention, and flash attention was completely broken with Command-A M3 Ultra Llama 3.
Benchmarks
tl;dr: Running ggufs in Koboldcpp, the M3 is marginally... slower? Slightly faster prompt processing, but slower prompt writing across all models. I added a comparison Llama.cpp run at the bottom; same speed as Kobold, give or take. Setup: * Inference engine: Koboldcpp 1.85.1 * Text: Same text on
Benchmarks
It's been a while since my last Mac speed post, so I figured it was about time to post a new one. I've noticed a lot of the old "I get 500 tokens per second!" kind of talk re-appearing, so I figured some cold-hard
A quick introduction before I begin. If you haven't had an opportunity to read it yet, please check out the first post: My personal guide for developing software with AI Assistance. This will not rehash that information, but is rather an addendum to it with new things that
Offline-Wiki-Api
Cross-Posting from Reddit This project is an answer to a previous question that I had about the easiest route to offline Wikipedia RAG. After mulling over the responses, txtai jumped out to me as the most straight forward. Since by default that dataset only returns the first paragraph of the
This post is a combination of some new results, old results, and reddit.com/u/invectorgator's results (with permission) to help give a clear picture of all testing so far. Links to the relevant posts can be found below. This was a lot of fun, and has lit
WilmerAI
Cross-Posting from Reddit IMPORTANT: This is an early development, barely even Alpha, release. Wilmer is a passion project for myself, but it felt stingy not to share it given how interested everyone was in it, so I released early. It's still months from what I'd consider
So, in the past I've mentioned that I use AI to assist in writing code for my personal projects, especially for things I use to automate stuff for myself, and I've gotten pretty mixed responses. Some folks say they do the same, others say AI can
WilmerAI
Ever since I first saw the group chat feature in SillyTavern, I've always wanted to have a team of AI to help me work on things. But I never liked the result of using 1 LLM to do it; it never really felt like it was doing me
For the past few months I've been working on a quiet little project on the weekends, whenever I can scrounge up time, and part of that project involves looking for the best models of each domain. Of course, there are some great coding, medical, math, etc finetunes, but
So after seeing a lot of folks recommending SillyTavern as a good front end for APIs, I finally decided to give it a better try. I mostly have been using Oobabooga, and while I had ST installed from many months ago, I never put a lot of time into understanding
Benchmarks
Previous Post Here are some real-world speeds for the Mac M2 Introduction In my previous post, I showed the raw real-world numbers of what non-cached response times would look like for a Mac Studio M2 Ultra. My goal was to demonstrate how well the machine really handles models at full