SomeOddCodeGuy's Ramblings (Page 2)

Twitter/X...

Somehow I've had it this far in life without ever actually using the site. But while I wait for one of my tickets withReddit to finally reach a human so that I can get my account back, outside of discord its one of the few places I can

My Unorthodox Homelab Setup: Updated

So my old Reddit post about my "unorthodox setup" went down with the reddit ship, and figured it was time for an update anyway, so I'm bringing it back. My setup has gotten more complex than I originally planned, built out piecemeal over the past 2.

Latency while using RDP

Ok, for anyone else RDPing into a Windows machine from a Mac that is experiencing a latency between sound and visuals, especially when watching a video: I just went into Settings and set "Graphics Interpolation Level" under "General" to `Medium`, and it had an immediately noticeable

Time For Another Studio? Wallet Beware...

Everyone saying that the new iPhone isn't much, but the fact that they added dedicated MatMul acceleration into the A19 is huge, because this means we'll probably see it in the M5. For folks like me- that's a dream come true. I love my

A Quick List of LLM Benchmarks

A quick dump of the benchmarks that I look at and use personally; I've dropped a few that no longer appear to be kept up to date, and grabbed a few newer ones. Code Specific * https://www.swebench.com/ * https://swe-rebench.com/ * https://aider.chat/docs/leaderboards/ Coding

RAG Really Is More of a Software Problem Than An AI Problem

RAG is really 90% a software development problem, 10% an AI problem. People overcomplicate it on the AI side a lot, but it's a $5 term for a $0.05 concept: give the LLM the answer before it responds to you. At its face, that's simple

That LinkedIn '95% of AI Ventures Fail' Stat That's Going Around...

So over the past week I'm suddenly seeing folks posting on LinkedIn about this number that 95% of corps fail to generate significant revenue with AI projects. Honestly, I'd believe it. Personally? I think that a big reason so many AI projects die is because folks

Mac Studio M3 Ultra Speeds for Qwen3 235b, GPT-OSS-120b, GLM 4.5, and Deepseek V3.1

M3 Ultra Mac Studio 512GB Speeds Qwen3 235b a22b Instruct Q8 in Llama.cpp server (~15k tokens) prompt eval time 4.60 ms per token, 217.29 tokens per second eval time 67.59 ms per token, 14.80 tokens per second total time 146863.82 ms / 15763 tokens (~5k

Reddit Shadowbans- A Deep Dive Into What Little I Could Find

So back in July, while using a fairly popular commercial VPN, I made a comment on the LocalLlama sub answering someone's question by linking one of my own posts with some benchmarks; something I would do often. After a few minutes, I decided to edit the post to

I'll Always Have A Softspot for the Old Text-Generation-WebUI Chat Bubbles

Found this screenshot from back in 2023; if I remember right, CodeLlama had just come out and I was trying to see how it would do in a coding interview. But look at that old interface for text-gen. The below picture is from the internet, and shows the new interface.

Running Deepseek R1 0528 q4_K_M and mlx 4-bit on a Mac Studio M3

Mac Model: M3 Ultra Mac Studio 512GB, 80 core GPU First- this model has a shockingly small KV Cache. If any of you saw my post about running Deepseek V3 q4_K_M, you'd have seen that the KV cache buffer in llama.cpp/koboldcpp was 157GB for

M3 Ultra Mac Studio 512GB prompt and write speeds for Deepseek V3 0 671b gguf q4_K_M, for those curious

UPDATE 2025-04-13: llama.cpp has had an update that GREATLY improved the prompt processing speed. Please see the new speeds below. Deepseek V3 0324 Q4_K_M w/Flash Attention 4800 token context, responding 552 tokens CtxLimit:4744/8192, Amt:552/4000, Init:0.07s, Process:65.46s (64.02T/

Benchmarks

Running Llama 3.1 405b q6 and Command-A 111b Q8 on M3 Ultra Mac Studio

Below are benchmarks of running Llama 3.1 405b q6 and Command A 111b Q8 on an M3 Ultra 512GB using KoboldCpp The 405b was so miserable to run that I didn't even try flash attention, and flash attention was completely broken with Command-A M3 Ultra Llama 3.

Benchmarks

Mac Speed Comparison: M2 Ultra vs M3 Ultra using KoboldCpp

tl;dr: Running ggufs in Koboldcpp, the M3 is marginally... slower? Slightly faster prompt processing, but slower prompt writing across all models. I added a comparison Llama.cpp run at the bottom; same speed as Kobold, give or take. Setup: * Inference engine: Koboldcpp 1.85.1 * Text: Same text on

Benchmarks

Low Context Speed Comparison: Macbook, Mac Studios, and RTX 4090

It's been a while since my last Mac speed post, so I figured it was about time to post a new one. I've noticed a lot of the old "I get 500 tokens per second!" kind of talk re-appearing, so I figured some cold-hard

My Personal Guide for Developing Software with AI Assistance: Part 2

A quick introduction before I begin. If you haven't had an opportunity to read it yet, please check out the first post: My personal guide for developing software with AI Assistance. This will not rehash that information, but is rather an addendum to it with new things that

Offline-Wiki-Api

Offline Wikipedia API- An easy to use offline API that serves up full text Wikipedia articles.

Cross-Posting from Reddit This project is an answer to a previous question that I had about the easiest route to offline Wikipedia RAG. After mulling over the responses, txtai jumped out to me as the most straight forward. Since by default that dataset only returns the first paragraph of the

MMLU-Pro Combined Results - Model Quantization Comparison

This post is a combination of some new results, old results, and reddit.com/u/invectorgator's results (with permission) to help give a clear picture of all testing so far. Links to the relevant posts can be found below. This was a lot of fun, and has lit

WilmerAI

Meet WilmerAI- my open source project to maximize the potential of Local LLMs via prompt routing and multi-model workflows

Cross-Posting from Reddit IMPORTANT: This is an early development, barely even Alpha, release. Wilmer is a passion project for myself, but it felt stingy not to share it given how interested everyone was in it, so I released early. It's still months from what I'd consider

My Personal Guide for Developing Software with AI Assistance

So, in the past I've mentioned that I use AI to assist in writing code for my personal projects, especially for things I use to automate stuff for myself, and I've gotten pretty mixed responses. Some folks say they do the same, others say AI can

WilmerAI

Almost a year later, I can finally do this. A small teaser of a project I'm working on

Ever since I first saw the group chat feature in SillyTavern, I've always wanted to have a team of AI to help me work on things. But I never liked the result of using 1 LLM to do it; it never really felt like it was doing me

Frankenmerges are actually kind of great...

For the past few months I've been working on a quiet little project on the weekends, whenever I can scrounge up time, and part of that project involves looking for the best models of each domain. Of course, there are some great coding, medical, math, etc finetunes, but

Ok, I admit- SillyTavern is a great way to test models after all

So after seeing a lot of folks recommending SillyTavern as a good front end for APIs, I finally decided to give it a better try. I mostly have been using Oobabooga, and while I had ST installed from many months ago, I never put a lot of time into understanding

Benchmarks

Real World Speeds on the Mac: Koboldcpp Context Shift Edition!

Previous Post Here are some real-world speeds for the Mac M2 Introduction In my previous post, I showed the raw real-world numbers of what non-cached response times would look like for a Mac Studio M2 Ultra. My goal was to demonstrate how well the machine really handles models at full

Latest

Twitter/X...

My Unorthodox Homelab Setup: Updated

Latency while using RDP

Time For Another Studio? Wallet Beware...

A Quick List of LLM Benchmarks

RAG Really Is More of a Software Problem Than An AI Problem

That LinkedIn '95% of AI Ventures Fail' Stat That's Going Around...

Mac Studio M3 Ultra Speeds for Qwen3 235b, GPT-OSS-120b, GLM 4.5, and Deepseek V3.1

Reddit Shadowbans- A Deep Dive Into What Little I Could Find

I'll Always Have A Softspot for the Old Text-Generation-WebUI Chat Bubbles

Running Deepseek R1 0528 q4_K_M and mlx 4-bit on a Mac Studio M3

M3 Ultra Mac Studio 512GB prompt and write speeds for Deepseek V3 0 671b gguf q4_K_M, for those curious

Running Llama 3.1 405b q6 and Command-A 111b Q8 on M3 Ultra Mac Studio

Mac Speed Comparison: M2 Ultra vs M3 Ultra using KoboldCpp

Low Context Speed Comparison: Macbook, Mac Studios, and RTX 4090

My Personal Guide for Developing Software with AI Assistance: Part 2

Offline Wikipedia API- An easy to use offline API that serves up full text Wikipedia articles.

MMLU-Pro Combined Results - Model Quantization Comparison

Meet WilmerAI- my open source project to maximize the potential of Local LLMs via prompt routing and multi-model workflows

My Personal Guide for Developing Software with AI Assistance

Almost a year later, I can finally do this. A small teaser of a project I'm working on

Frankenmerges are actually kind of great...

Ok, I admit- SillyTavern is a great way to test models after all

Real World Speeds on the Mac: Koboldcpp Context Shift Edition!