GLM 4.6 MXFP4 vs q8_0 gguf speeds on Mac M3 Ultra

Someone asked me to run the mxfp4 gguf vs q8, so I figured I'd post the results here too for anyone to see.

As expected mxfp4 comes out to a little over half the size of the q8, and the speed is just a bit faster. I expect if we see this quant on MLX, we'll see even higher speeds.

I tested out a couple of responses and they looked pretty good for mxfp4. With that said, I'll personally be sticking to the q8_0 for now.


Q8_0 gguf @ 65535 Context

load_tensors: Metal_Mapped model buffer size = 37190.20 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 22520.34 MiB
load_tensors:   CPU_Mapped model buffer size =   786.25 MiB

...

llama_kv_cache:      Metal KV buffer size = 23552.00 MiB

Total Size: 357.6GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 381.8GB

prompt eval time 
    19593.48 ms /  2365 tokens 
    8.28 ms per token   
    120.70 tokens per second
eval time
    99184.67 ms /  1416 tokens 
    70.05 ms per token    
    14.28 tokens per second
      
total time 
    118778.15 ms (1 minute 58 seconds)
    3781 tokens

MXFP4 gguf

load_tensors: Metal_Mapped model buffer size = 38020.91 MiB
load_tensors: Metal_Mapped model buffer size = 38045.00 MiB
load_tensors: Metal_Mapped model buffer size = 38048.12 MiB
load_tensors: Metal_Mapped model buffer size = 38114.15 MiB
load_tensors: Metal_Mapped model buffer size = 35252.64 MiB
load_tensors:   CPU_Mapped model buffer size =   786.25 MiB

...

llama_kv_cache:      Metal KV buffer size = 23552.00 MiB

Total Size: 183.4GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 207.6GB

prompt eval time 
    16107.04 ms /  2365 tokens 
    6.81 ms per token   
    146.83 tokens per second
eval time
    57142.26 ms /   983 tokens 
    58.13 ms per token    
    17.20 tokens per second

total time =   
    73249.30 ms (1 minute 13 seconds)
    3348 tokens