GLM 4.6 MXFP4 vs q8_0 gguf speeds on Mac M3 Ultra

Someone asked me to run the mxfp4 gguf vs q8, so I figured I'd post the results here too for anyone to see.

As expected mxfp4 comes out to a little over half the size of the q8, and the speed is just a bit faster. I expect if we see this quant on MLX, we'll see even higher speeds.

I tested out a couple of responses and they looked pretty good for mxfp4. With that said, I'll personally be sticking to the q8_0 for now.

Q8_0 gguf @ 65535 Context

load_tensors: Metal_Mapped model buffer size = 37190.20 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 22520.34 MiB
load_tensors:   CPU_Mapped model buffer size =   786.25 MiB

...

llama_kv_cache:      Metal KV buffer size = 23552.00 MiB

Total Size: 357.6GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 381.8GB

prompt eval time 
    19593.48 ms /  2365 tokens 
    8.28 ms per token   
    120.70 tokens per second
eval time
    99184.67 ms /  1416 tokens 
    70.05 ms per token    
    14.28 tokens per second
      
total time 
    118778.15 ms (1 minute 58 seconds)
    3781 tokens

MXFP4 gguf

load_tensors: Metal_Mapped model buffer size = 38020.91 MiB
load_tensors: Metal_Mapped model buffer size = 38045.00 MiB
load_tensors: Metal_Mapped model buffer size = 38048.12 MiB
load_tensors: Metal_Mapped model buffer size = 38114.15 MiB
load_tensors: Metal_Mapped model buffer size = 35252.64 MiB
load_tensors:   CPU_Mapped model buffer size =   786.25 MiB

...

llama_kv_cache:      Metal KV buffer size = 23552.00 MiB

Total Size: 183.4GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 207.6GB

prompt eval time 
    16107.04 ms /  2365 tokens 
    6.81 ms per token   
    146.83 tokens per second
eval time
    57142.26 ms /   983 tokens 
    58.13 ms per token    
    17.20 tokens per second

total time =   
    73249.30 ms (1 minute 13 seconds)
    3348 tokens

If You Have the Hardware- Use it to Learn!

If you've never messed with open source LLMs and you jumped on the ClawdBot/OpenClaw hype train: take some time to learn more about how local models work. You likely went through the trouble of getting a Mac Mini, so you now have a nice little test box

An Analogy to Help Understand Mixture of Experts

If you're having a hard time understanding MoE strength vs dense models, and roughly where they might land when comparing them, think about this super oversimplified analogy. I'm hoping it makes sense: The Scenario Imagine a paid trivia competition, but all the questions are about carpentry

I Won't Miss The Cold...

This has nothing to do with technology, but just so you know- I'm a tropical beastie, and I absolutely will not miss the 22 degree weather this pass weekend. I am no longer built for this. That is all.

My Personal Guide for Developing Software with AI Assistance - 2026 Edition

What's Changed Since 2024 So back in May of 2024 I wrote the first version of this little guide, at a time when agents were absolute crap and Wilmer was still in a state that couldn't even be called v0.01. Back then it got a