GLM 4.6 MXFP4 vs q8_0 gguf speeds on Mac M3 Ultra
Someone asked me to run the mxfp4 gguf vs q8, so I figured I'd post the results here too for anyone to see.
As expected mxfp4 comes out to a little over half the size of the q8, and the speed is just a bit faster. I expect if we see this quant on MLX, we'll see even higher speeds.
I tested out a couple of responses and they looked pretty good for mxfp4. With that said, I'll personally be sticking to the q8_0 for now.
Q8_0 gguf @ 65535 Context
load_tensors: Metal_Mapped model buffer size = 37190.20 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 37198.36 MiB
load_tensors: Metal_Mapped model buffer size = 37333.45 MiB
load_tensors: Metal_Mapped model buffer size = 37195.23 MiB
load_tensors: Metal_Mapped model buffer size = 22520.34 MiB
load_tensors: CPU_Mapped model buffer size = 786.25 MiB
...
llama_kv_cache: Metal KV buffer size = 23552.00 MiB
Total Size: 357.6GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 381.8GB
prompt eval time
19593.48 ms / 2365 tokens
8.28 ms per token
120.70 tokens per second
eval time
99184.67 ms / 1416 tokens
70.05 ms per token
14.28 tokens per second
total time
118778.15 ms (1 minute 58 seconds)
3781 tokens
MXFP4 gguf
load_tensors: Metal_Mapped model buffer size = 38020.91 MiB
load_tensors: Metal_Mapped model buffer size = 38045.00 MiB
load_tensors: Metal_Mapped model buffer size = 38048.12 MiB
load_tensors: Metal_Mapped model buffer size = 38114.15 MiB
load_tensors: Metal_Mapped model buffer size = 35252.64 MiB
load_tensors: CPU_Mapped model buffer size = 786.25 MiB
...
llama_kv_cache: Metal KV buffer size = 23552.00 MiB
Total Size: 183.4GB Metal VRAM buffer + 786MB CPU buffer + 23.5GB KV Cache == 207.6GB
prompt eval time
16107.04 ms / 2365 tokens
6.81 ms per token
146.83 tokens per second
eval time
57142.26 ms / 983 tokens
58.13 ms per token
17.20 tokens per second
total time =
73249.30 ms (1 minute 13 seconds)
3348 tokens