MMLU-Pro Combined Results - Model Quantization Comparison

This post is a combination of some new results, old results, and reddit.com/u/invectorgator's results (with permission) to help give a clear picture of all testing so far. Links to the relevant posts can be found below.

This was a lot of fun, and has lit a fire under me about benchmarking. I have some ideas for a personal benchmarking tool using Wilmer that will be easier for me to run. Will share more info once I dig into it.

As usual, a few notes about the tests:

These tests were performed using u/chibop1's MMLU-Pro project. Be sure to swing by and thank them for giving us this fun toy
With the permission of u/invectorgator, this post will combine all of our results together.
- We both used the same commits of the MMLU-Pro project, we both used only q8 ggufs (unless otherwise specified) and both used Text-Generation-WebUI for our backends to guarantee correct prompt templating, so our test results are compatible
I didn't do these tests expecting them to be super scientific and accurate assessments of an LLM's knowledge. I understand the concerns people have about them. But they do test a combination of knowledge AND instruction following. They aren't perfect, but it's better than just perplexity testing.
Invectorgator is doing Gemma, so I'm not
Qwen 2 7b just really does not like this test; at least running in text-gen.

New Models In This Test

This test will add the following new models to the pile. I went with some of my personal favorite fine-tunes. You can find the exact GGUFs that I used below, and you can see the above posts for the exact ggufs for the other models:

Hermes 2 Theta Llama 3 8b
Llama 3 8b SPPO
WizardLM-2-7b
OpenHermes-2.5-Mistral-7b
- I re-ran this because the results in invectorgator's small model post was an unquantized version run in Aphrodite
Nous Capybara 34b
Dolphin Mixtral 2.5 8x7b
- Yes, I'm aware of newer versions like 2.7. But for some reason, even after 2.7 came out a lot of folks held true to 2.5. I perceive this version as more popular, so I went with it

Old Posts Combined Into This One:

Key Takeaway

I am now convinced that Hermes 2 Theta Llama 3 8b is secretly a 30b in disguise. To say it is punching above its weight is an understatement.

All below tests are ggufs (q8 unless otherwise noted) running in Text-Generation-WebUI. The tests require > 4096 context, so some model versions were chosen to fit that need.

Line breaks are for loose grouping.

Business

Model	Correct	Score (%)
WizardLM-2-7b	277/789	35.11
Open-Hermes-2.5-7b	285/789	36.12
Mistral-7b-Inst-v0.3-q8	265/789	33.59
Llama-3-8b-q4_K_M	148/789	18.76
Llama-3-8b-q8	160/789	20.28
Llama-3-8b-SPPO-Iter-3	247/789	31.31
Hermes-2-Theta-Llama-3-8b	330/789	41.83
Yi-1.5-9b-32k-q8	240/789	30.42
Phi-Medium-128k-q8	260/789	32.95
Mixtral-8x7b-Instruct-Q8	310/789	39.29
Dolphin-Mixtral-2.5-8x7b	350/789	44.36
Nous-Capybara-34b	313/789	39.67
Yi-1.5-34B-32K-Q8	325/789	41.19
Command-R-v01-Q8	126/789	15.97
Llama-3-70b-FP16-Q2_KXXS	254/789	32.19
Llama-3-70b-FP16-Q2_K	309/789	39.16
Llama-3-70b-FP16-Q4_K_M	427/789	54.12
Llama-3-70b-FP16-Q5_K_M	415/789	52.60
Llama-3-70b-FP16-Q6_K	408/789	51.71
Llama-3-70b-FP16-Q8_0	411/789	52.09

Law

Model	Correct	Score (%)
WizardLM-2-7b	282/1101	25.61
Open-Hermes-2.5-7b	260/1101	23.61
Mistral-7b-Inst-v0.3-q8	248/1101	22.52
Yi-1.5-9b-32k-q8	191/1101	17.35
Phi-Medium-128k-q8	255/1101	23.16
Llama-3-8b-q4_K_M	161/1101	14.62
Llama-3-8b-q8	172/1101	15.62
Llama-3-8b-SPPO-Iter-3	200/1101	18.17
Hermes-2-Theta-Llama-3-8b	280/1101	25.43
Mixtral-8x7b-Instruct-Q8	282/1101	25.61
Dolphin-Mixtral-2.5-8x7b	271/1101	24.61
Nous-Capybara-34b	369/1101	33.51
Yi-1.5-34B-32K-Q8	417/1101	37.87
Command-R-v01-Q8	146/1101	13.26
Llama-3-70b-FP16-Q2_KXXS	362/1101	32.88
Llama-3-70b-FP16-Q2_K	416/1101	37.78
Llama-3-70b-FP16-Q4_K_M	471/1101	42.78
Llama-3-70b-FP16-Q5_K_M	469/1101	42.60
Llama-3-70b-FP16-Q6_K	469/1101	42.60
Llama-3-70b-FP16-Q8_0	464/1101	42.14

Psychology

Model	Correct	Score (%)
WizardLM-2-7b	430/798	53.88
Open-Hermes-2.5-7b	434/798	54.39
Mistral-7b-Inst-v0.3-q8	343/798	42.98
Llama-3-8b-q4_K_M	328/798	41.10
Llama-3-8b-q8	372/798	46.62
Llama-3-8b-SPPO-Iter-3	252/798	31.58
Hermes-2-Theta-Llama-3-8b	452/798	56.64
Yi-1.5-9b-32k-q8	173/798	21.68
Phi-Medium-128k-q8	358/798	44.86
Mixtral-8x7b-Instruct-Q8	365/798	45.74
Dolphin-Mixtral-2.5-8x7b	468/798	58.65
Nous-Capybara-34b	474/798	59.40
Yi-1.5-34B-32K-Q8	510/798	63.91
Command-R-v01-Q8	131/798	16.42
Llama-3-70b-FP16-Q2_KXXS	493/798	61.78
Llama-3-70b-FP16-Q2_K	565/798	70.80
Llama-3-70b-FP16-Q4_K_M	597/798	74.81
Llama-3-70b-FP16-Q5_K_M	611/798	76.57
Llama-3-70b-FP16-Q6_K	605/798	75.81
Llama-3-70b-FP16-Q8_0	605/798	75.81

Biology

Model	Correct	Score (%)
WizardLM-2-7b	427/717	59.55
Open-Hermes-2.5-7b	417/717	58.16
Mistral-7b-Inst-v0.3-q8	390/717	54.39
Llama-3-8b-q4_K_M	412/717	57.46
Llama-3-8b-q8	424/717	59.14
Llama-3-8b-SPPO-Iter-3	316/717	44.07
Hermes-2-Theta-Llama-3-8b	453/717	63.18
Yi-1.5-9b-32k-q8	288/717	40.17
Phi-Medium-128k-q8	262/717	36.54
Mixtral-8x7b-Instruct-Q8	334/717	46.58
Dolphin-Mixtral-2.5-8x7b	434/717	60.53
Nous-Capybara-34b	473/717	65.97
Yi-1.5-34B-32K-Q8	521/717	72.66
Command-R-v01-Q8	138/717	19.25
Llama-3-70b-FP16-Q2_KXXS	510/717	71.13
Llama-3-70b-FP16-Q2_K	556/717	77.55
Llama-3-70b-FP16-Q4_K_M	581/717	81.03
Llama-3-70b-FP16-Q5_K_M	579/717	80.75
Llama-3-70b-FP16-Q6_K	574/717	80.06
Llama-3-70b-FP16-Q8_0	581/717	81.03

Chemistry

Model	Correct	Score (%)
WizardLM-2-7b	246/1132	21.73
Open-Hermes-2.5-7b	298/1132	26.33
Mistral-7b-Inst-v0.3-q8	265/1132	23.41
Llama-3-8b-q4_K_M	163/1132	14.40
Llama-3-8b-q8	175/1132	15.46
Llama-3-8b-SPPO-Iter-3	236/1132	20.85
Hermes-2-Theta-Llama-3-8b	330/1132	29.15
Yi-1.5-9b-32k-q8	270/1132	23.85
Phi-Medium-128k-q8	207/1132	18.29
Mixtral-8x7b-Instruct-Q8	338/1132	29.86
Dolphin-Mixtral-2.5-8x7b	369/1132	32.60
Nous-Capybara-34b	368/1132	32.51
Yi-1.5-34B-32K-Q8	350/1132	30.92
Command-R-v01-Q8	129/1132	11.40
Llama-3-70b-FP16-Q2_KXXS	331/1132	29.24
Llama-3-70b-FP16-Q2_K	378/1132	33.39
Llama-3-70b-FP16-Q4_K_M	475/1132	41.96
Llama-3-70b-FP16-Q5_K_M	493/1132	43.55
Llama-3-70b-FP16-Q6_K	461/1132	40.72
Llama-3-70b-FP16-Q8_0	502/1132	44.35

History

Model	Correct	Score (%)
WizardLM-2-7b	143/381	37.53
Open-Hermes-2.5-7b	148/381	38.85
Mistral-7b-Inst-v0.3-q8	120/381	31.50
Llama-3-8b-q4_K_M	82/381	21.52
Llama-3-8b-q8	94/381	24.67
Llama-3-8b-SPPO-Iter-3	70/381	18.37
Hermes-2-Theta-Llama-3-8b	155/381	40.68
Yi-1.5-9b-32k-q8	69/381	18.11
Phi-Medium-128k-q8	119/381	31.23
Mixtral-8x7b-Instruct-Q8	116/381	30.45
Dolphin-Mixtral-2.5-8x7b	155/381	40.68
Nous-Capybara-34b	105/381	27.56
Yi-1.5-34B-32K-Q8	174/381	45.67
Command-R-v01-Q8	40/381	10.50
Llama-3-70b-FP16-Q2_KXXS	174/381	45.67
Llama-3-70b-FP16-Q2_K	213/381	55.91
Llama-3-70b-FP16-Q4_K_M	232/381	60.89
Llama-3-70b-FP16-Q5_K_M	231/381	60.63
Llama-3-70b-FP16-Q6_K	231/381	60.63
Llama-3-70b-FP16-Q8_0	231/381	60.63

Other

Model	Correct	Score (%)
WizardLM-2-7b	375/924	40.58
Open-Hermes-2.5-7b	392/924	42.42
Mistral-7b-Inst-v0.3-q8	327/924	35.39
Llama-3-8b-q4_K_M	269/924	29.11
Llama-3-8b-q8	292/924	31.60
Llama-3-8b-SPPO-Iter-3	270/924	29.22
Hermes-2-Theta-Llama-3-8b	429/924	46.43
Yi-1.5-9b-32k-q8	227/924	24.57
Phi-Medium-128k-q8	388/924	41.99
Mixtral-8x7b-Instruct-Q8	355/924	38.42
Dolphin-Mixtral-2.5-8x7b	448/924	48.48
Nous-Capybara-34b	451/924	48.81
Yi-1.5-34B-32K-Q8	481/924	52.06
Command-R-v01-Q8	131/924	14.18
Llama-3-70b-FP16-Q2_KXXS	395/924	42.75
Llama-3-70b-FP16-Q2_K	472/924	51.08
Llama-3-70b-FP16-Q4_K_M	529/924	57.25
Llama-3-70b-FP16-Q5_K_M	552/924	59.74
Llama-3-70b-FP16-Q6_K	546/924	59.09
Llama-3-70b-FP16-Q8_0	556/924	60.17

Health

Model	Correct	Score (%)
WizardLM-2-7b	376/818	45.97
Open-Hermes-2.5-7b	356/818	43.52
Mistral-7b-Inst-v0.3-q8	294/818	35.94
Llama-3-8b-q4_K_M	216/818	26.41
Llama-3-8b-q8	263/818	32.15
Llama-3-8b-SPPO-Iter-3	229/818	28.00
Hermes-2-Theta-Llama-3-8b	388/818	47.43
Yi-1.5-9b-32k-q8	227/818	27.75
Phi-Medium-128k-q8	349/818	42.67
Mixtral-8x7b-Instruct-Q8	325/818	39.73
Dolphin-Mixtral-2.5-8x7b	367/818	44.87
Nous-Capybara-34b	348/818	42.54
Yi-1.5-34B-32K-Q8	468/818	57.21
Command-R-v01-Q8	110/818	13.45
Llama-3-70b-FP16-Q2_KXXS	406/818	49.63
Llama-3-70b-FP16-Q2_K	502/818	61.37
Llama-3-70b-FP16-Q4_K_M	542/818	66.26
Llama-3-70b-FP16-Q5_K_M	551/818	67.36
Llama-3-70b-FP16-Q6_K	546/818	66.75
Llama-3-70b-FP16-Q8_0	544/818	66.50

Economics

Model	Correct	Score (%)
WizardLM-2-7b	391/844	46.33
Open-Hermes-2.5-7b	407/844	48.22
Mistral-7b-Inst-v0.3-q8	343/844	40.64
Llama-3-8b-q4_K_M	307/844	36.37
Llama-3-8b-q8	309/844	36.61
Llama-3-8b-SPPO-Iter-3	249/844	29.50
Hermes-2-Theta-Llama-3-8b	448/844	53.08
Yi-1.5-9b-32k-q8	290/844	34.36
Phi-Medium-128k-q8	369/844	43.72
Mixtral-8x7b-Instruct-Q8	415/844	49.17
Dolphin-Mixtral-2.5-8x7b	462/844	54.74
Nous-Capybara-34b	451/844	53.44
Yi-1.5-34B-32K-Q8	519/844	61.49
Command-R-v01-Q8	198/844	23.46
Llama-3-70b-FP16-Q2_KXXS	494/844	58.53
Llama-3-70b-FP16-Q2_K	565/844	66.94
Llama-3-70b-FP16-Q4_K_M	606/844	71.80
Llama-3-70b-FP16-Q5_K_M	623/844	73.82
Llama-3-70b-FP16-Q6_K	614/844	72.75
Llama-3-70b-FP16-Q8_0	625/844	74.05

Math

Model	Correct	Score (%)
WizardLM-2-7b	379/1351	28.05
Open-Hermes-2.5-7b	423/1351	31.31
Mistral-7b-Inst-v0.3-q8	399/1351	29.53
Llama-3-8b-q4_K_M	202/1351	14.95
Llama-3-8b-q8	167/1351	12.36
Llama-3-8b-SPPO-Iter-3	392/1351	29.02
Hermes-2-Theta-Llama-3-8b	509/1351	37.68
Yi-1.5-9b-32k-q8	370/1351	27.39
Phi-Medium-128k-q8	299/1351	22.13
Mixtral-8x7b-Instruct-Q8	475/1351	35.16
Dolphin-Mixtral-2.5-8x7b	487/1351	36.04
Nous-Capybara-34b	347/1351	25.68
Yi-1.5-34B-32K-Q8	467/1351	34.57
Command-R-v01-Q8	166/1351	12.29
Llama-3-70b-FP16-Q2_KXXS	336/1351	24.87
Llama-3-70b-FP16-Q2_K	436/1351	32.27
Llama-3-70b-FP16-Q4_K_M	529/1351	39.16
Llama-3-70b-FP16-Q5_K_M	543/1351	40.19
Llama-3-70b-FP16-Q6_K	547/1351	40.49
Llama-3-70b-FP16-Q8_0	532/1351	39.38

Physics

Model	Correct	Score (%)
WizardLM-2-7b	344/1299	26.48
Open-Hermes-2.5-7b	351/1299	27.02
Mistral-7b-Inst-v0.3-q8	338/1299	26.02
Llama-3-8b-q4_K_M	168/1299	12.93
Llama-3-8b-q8	178/1299	13.70
Llama-3-8b-SPPO-Iter-3	312/1299	24.02
Hermes-2-Theta-Llama-3-8b	417/1299	32.10
Yi-1.5-9b-32k-q8	321/1299	24.71
Phi-Medium-128k-q8	312/1299	24.02
Mixtral-8x7b-Instruct-Q8	442/1299	34.03
Dolphin-Mixtral-2.5-8x7b	410/1299	31.56
Nous-Capybara-34b	404/1299	31.10
Yi-1.5-34B-32K-Q8	483/1299	37.18
Command-R-v01-Q8	166/1299	12.78
Llama-3-70b-FP16-Q2_KXXS	382/1299	29.41
Llama-3-70b-FP16-Q2_K	478/1299	36.80
Llama-3-70b-FP16-Q4_K_M	541/1299	41.65
Llama-3-70b-FP16-Q5_K_M	565/1299	43.49
Llama-3-70b-FP16-Q6_K	550/1299	42.34
Llama-3-70b-FP16-Q8_0	544/1299	41.88

Computer Science

Model	Correct	Score (%)
WizardLM-2-7b	137/410	33.41
Open-Hermes-2.5-7b	166/410	40.49
Mistral-7b-Inst-v0.3-q8	120/410	29.27
Llama-3-8b-q4_K_M	105/410	25.61
Llama-3-8b-q8	125/410	30.49
Llama-3-8b-SPPO-Iter-3	130/410	31.71
Hermes-2-Theta-Llama-3-8b	169/410	41.22
Yi-1.5-9b-32k-q8	96/410	23.41
Phi-Medium-128k-q8	131/410	31.95
Mixtral-8x7b-Instruct-Q8	150/410	36.59
Dolphin-Mixtral-2.5-8x7b	177/410	43.17
Nous-Capybara-34b	134/410	32.68
Yi-1.5-34B-32K-Q8	191/410	46.59
Command-R-v01-Q8	61/410	14.88
Llama-3-70b-FP16-Q2_KXXS	186/410	45.37
Llama-3-70b-FP16-Q2_K	199/410	48.54
Llama-3-70b-FP16-Q4_K_M	239/410	58.29
Llama-3-70b-FP16-Q5_K_M	241/410	58.78
Llama-3-70b-FP16-Q6_K	240/410	58.54
Llama-3-70b-FP16-Q8_0	238/410	58.05

Philosophy

Model	Correct	Score (%)
WizardLM-2-7b	170/499	34.07
Open-Hermes-2.5-7b	200/499	40.08
Mistral-7b-Inst-v0.3-q8	175/499	35.07
Llama-3-8b-q4_K_M	152/499	30.46
Llama-3-8b-q8	161/499	32.26
Llama-3-8b-SPPO-Iter-3	142/499	28.46
Hermes-2-Theta-Llama-3-8b	194/499	38.88
Yi-1.5-9b-32k-q8	114/499	22.85
Phi-Medium-128k-q8	187/499	37.47
Mixtral-8x7b-Instruct-Q8	194/499	38.88
Dolphin-Mixtral-2.5-8x7b	212/499	42.48
Nous-Capybara-34b	197/499	39.48
Yi-1.5-34B-32K-Q8	257/499	51.50
Command-R-v01-Q8	160/499	32.06
Llama-3-70b-FP16-Q2_KXXS	200/499	40.08
Llama-3-70b-FP16-Q2_K	258/499	51.70
Llama-3-70b-FP16-Q4_K_M	282/499	56.51
Llama-3-70b-FP16-Q5_K_M	281/499	56.31
Llama-3-70b-FP16-Q6_K	283/499	56.71
Llama-3-70b-FP16-Q8_0	278/499	55.71

Engineering

Model	Correct	Score (%)
WizardLM-2-7b	196/969	20.23
Open-Hermes-2.5-7b	193/969	19.92
Mistral-7b-Inst-v0.3-q8	198/969	20.43
Llama-3-8b-q4_K_M	149/969	15.38
Llama-3-8b-q8	166/969	17.13
Llama-3-8b-SPPO-Iter-3	165/969	17.03
Hermes-2-Theta-Llama-3-8b	245/969	25.28
Yi-1.5-9b-32k-q8	190/969	19.61
Phi-Medium-128k-q8	183/969	18.89
Mixtral-8x7b-Instruct-Q8	234/969	24.15
Dolphin-Mixtral-2.5-8x7b	236/969	24.35
Nous-Capybara-34b	393/969	40.56
Yi-1.5-34B-32K-Q8	408/969	42.11
Command-R-v01-Q8	145/969	14.96
Llama-3-70b-FP16-Q2_KXXS	326/969	33.64
Llama-3-70b-FP16-Q2_K	375/969	38.70
Llama-3-70b-FP16-Q4_K_M	394/969	40.66
Llama-3-70b-FP16-Q5_K_M	417/969	43.03
Llama-3-70b-FP16-Q6_K	406/969	41.90
Llama-3-70b-FP16-Q8_0	398/969	41.07

Totals

Model	Total Correct	Total Score (%)
WizardLM-2-7b	4173/12032	34.68
Open-Hermes-2.5-7b	4330/12032	35.99
Mistral-7b-Inst-v0.3-q8	3825/12032	31.79
Llama-3-8b-q4_K_M	2862/12032	23.79
Llama-3-8b-q8	3058/12032	25.42
Llama-3-8b-SPPO-Iter-3	3210/12032	26.68
Hermes-2-Theta-Llama-3-8b	4799/12032	39.89
Yi-1.5-9b-32k-q8	3066/12032	25.48
Phi-Medium-128k-q8	3679/12032	30.58
Mixtral-8x7b-Instruct-Q8	4335/12032	36.03
Dolphin-Mixtral-2.5-8x7b	4846/12032	40.27
Nous-Capybara-34b	4827/12032	40.12
Yi-1.5-34B-32K-Q8	5571/12032	46.30
Command-R-v01-Q8	1847/12032	15.35
Llama-3-70b-FP16-Q2_KXXS	4849/12032	40.30
Llama-3-70b-FP16-Q2_K	5722/12032	47.56
Llama-3-70b-FP16-Q4_K_M	6445/12032	53.57
Llama-3-70b-FP16-Q5_K_M	6571/12032	54.61
Llama-3-70b-FP16-Q6_K	6480/12032	53.86
Llama-3-70b-FP16-Q8_0	6509/12032	54.10