MMLU-Pro Combined Results - Model Quantization Comparison

This post is a combination of some new results, old results, and reddit.com/u/invectorgator's results (with permission) to help give a clear picture of all testing so far. Links to the relevant posts can be found below.

This was a lot of fun, and has lit a fire under me about benchmarking. I have some ideas for a personal benchmarking tool using Wilmer that will be easier for me to run. Will share more info once I dig into it.

As usual, a few notes about the tests:

  • These tests were performed using u/chibop1's MMLU-Pro project. Be sure to swing by and thank them for giving us this fun toy
  • With the permission of u/invectorgator, this post will combine all of our results together.
    • We both used the same commits of the MMLU-Pro project, we both used only q8 ggufs (unless otherwise specified) and both used Text-Generation-WebUI for our backends to guarantee correct prompt templating, so our test results are compatible
  • I didn't do these tests expecting them to be super scientific and accurate assessments of an LLM's knowledge. I understand the concerns people have about them. But they do test a combination of knowledge AND instruction following. They aren't perfect, but it's better than just perplexity testing.
  • Invectorgator is doing Gemma, so I'm not
  • Qwen 2 7b just really does not like this test; at least running in text-gen.

New Models In This Test

This test will add the following new models to the pile. I went with some of my personal favorite fine-tunes. You can find the exact GGUFs that I used below, and you can see the above posts for the exact ggufs for the other models:

Old Posts Combined Into This One:

Key Takeaway

I am now convinced that Hermes 2 Theta Llama 3 8b is secretly a 30b in disguise. To say it is punching above its weight is an understatement.

All below tests are ggufs (q8 unless otherwise noted) running in Text-Generation-WebUI. The tests require > 4096 context, so some model versions were chosen to fit that need.

Line breaks are for loose grouping.

Business

Model Correct Score (%)
WizardLM-2-7b 277/789 35.11
Open-Hermes-2.5-7b 285/789 36.12
Mistral-7b-Inst-v0.3-q8 265/789 33.59
Llama-3-8b-q4_K_M 148/789 18.76
Llama-3-8b-q8 160/789 20.28
Llama-3-8b-SPPO-Iter-3 247/789 31.31
Hermes-2-Theta-Llama-3-8b 330/789 41.83
Yi-1.5-9b-32k-q8 240/789 30.42
Phi-Medium-128k-q8 260/789 32.95
Mixtral-8x7b-Instruct-Q8 310/789 39.29
Dolphin-Mixtral-2.5-8x7b 350/789 44.36
Nous-Capybara-34b 313/789 39.67
Yi-1.5-34B-32K-Q8 325/789 41.19
Command-R-v01-Q8 126/789 15.97
Llama-3-70b-FP16-Q2_KXXS 254/789 32.19
Llama-3-70b-FP16-Q2_K 309/789 39.16
Llama-3-70b-FP16-Q4_K_M 427/789 54.12
Llama-3-70b-FP16-Q5_K_M 415/789 52.60
Llama-3-70b-FP16-Q6_K 408/789 51.71
Llama-3-70b-FP16-Q8_0 411/789 52.09

Law

Model Correct Score (%)
WizardLM-2-7b 282/1101 25.61
Open-Hermes-2.5-7b 260/1101 23.61
Mistral-7b-Inst-v0.3-q8 248/1101 22.52
Yi-1.5-9b-32k-q8 191/1101 17.35
Phi-Medium-128k-q8 255/1101 23.16
Llama-3-8b-q4_K_M 161/1101 14.62
Llama-3-8b-q8 172/1101 15.62
Llama-3-8b-SPPO-Iter-3 200/1101 18.17
Hermes-2-Theta-Llama-3-8b 280/1101 25.43
Mixtral-8x7b-Instruct-Q8 282/1101 25.61
Dolphin-Mixtral-2.5-8x7b 271/1101 24.61
Nous-Capybara-34b 369/1101 33.51
Yi-1.5-34B-32K-Q8 417/1101 37.87
Command-R-v01-Q8 146/1101 13.26
Llama-3-70b-FP16-Q2_KXXS 362/1101 32.88
Llama-3-70b-FP16-Q2_K 416/1101 37.78
Llama-3-70b-FP16-Q4_K_M 471/1101 42.78
Llama-3-70b-FP16-Q5_K_M 469/1101 42.60
Llama-3-70b-FP16-Q6_K 469/1101 42.60
Llama-3-70b-FP16-Q8_0 464/1101 42.14

Psychology

Model Correct Score (%)
WizardLM-2-7b 430/798 53.88
Open-Hermes-2.5-7b 434/798 54.39
Mistral-7b-Inst-v0.3-q8 343/798 42.98
Llama-3-8b-q4_K_M 328/798 41.10
Llama-3-8b-q8 372/798 46.62
Llama-3-8b-SPPO-Iter-3 252/798 31.58
Hermes-2-Theta-Llama-3-8b 452/798 56.64
Yi-1.5-9b-32k-q8 173/798 21.68
Phi-Medium-128k-q8 358/798 44.86
Mixtral-8x7b-Instruct-Q8 365/798 45.74
Dolphin-Mixtral-2.5-8x7b 468/798 58.65
Nous-Capybara-34b 474/798 59.40
Yi-1.5-34B-32K-Q8 510/798 63.91
Command-R-v01-Q8 131/798 16.42
Llama-3-70b-FP16-Q2_KXXS 493/798 61.78
Llama-3-70b-FP16-Q2_K 565/798 70.80
Llama-3-70b-FP16-Q4_K_M 597/798 74.81
Llama-3-70b-FP16-Q5_K_M 611/798 76.57
Llama-3-70b-FP16-Q6_K 605/798 75.81
Llama-3-70b-FP16-Q8_0 605/798 75.81

Biology

Model Correct Score (%)
WizardLM-2-7b 427/717 59.55
Open-Hermes-2.5-7b 417/717 58.16
Mistral-7b-Inst-v0.3-q8 390/717 54.39
Llama-3-8b-q4_K_M 412/717 57.46
Llama-3-8b-q8 424/717 59.14
Llama-3-8b-SPPO-Iter-3 316/717 44.07
Hermes-2-Theta-Llama-3-8b 453/717 63.18
Yi-1.5-9b-32k-q8 288/717 40.17
Phi-Medium-128k-q8 262/717 36.54
Mixtral-8x7b-Instruct-Q8 334/717 46.58
Dolphin-Mixtral-2.5-8x7b 434/717 60.53
Nous-Capybara-34b 473/717 65.97
Yi-1.5-34B-32K-Q8 521/717 72.66
Command-R-v01-Q8 138/717 19.25
Llama-3-70b-FP16-Q2_KXXS 510/717 71.13
Llama-3-70b-FP16-Q2_K 556/717 77.55
Llama-3-70b-FP16-Q4_K_M 581/717 81.03
Llama-3-70b-FP16-Q5_K_M 579/717 80.75
Llama-3-70b-FP16-Q6_K 574/717 80.06
Llama-3-70b-FP16-Q8_0 581/717 81.03

Chemistry

Model Correct Score (%)
WizardLM-2-7b 246/1132 21.73
Open-Hermes-2.5-7b 298/1132 26.33
Mistral-7b-Inst-v0.3-q8 265/1132 23.41
Llama-3-8b-q4_K_M 163/1132 14.40
Llama-3-8b-q8 175/1132 15.46
Llama-3-8b-SPPO-Iter-3 236/1132 20.85
Hermes-2-Theta-Llama-3-8b 330/1132 29.15
Yi-1.5-9b-32k-q8 270/1132 23.85
Phi-Medium-128k-q8 207/1132 18.29
Mixtral-8x7b-Instruct-Q8 338/1132 29.86
Dolphin-Mixtral-2.5-8x7b 369/1132 32.60
Nous-Capybara-34b 368/1132 32.51
Yi-1.5-34B-32K-Q8 350/1132 30.92
Command-R-v01-Q8 129/1132 11.40
Llama-3-70b-FP16-Q2_KXXS 331/1132 29.24
Llama-3-70b-FP16-Q2_K 378/1132 33.39
Llama-3-70b-FP16-Q4_K_M 475/1132 41.96
Llama-3-70b-FP16-Q5_K_M 493/1132 43.55
Llama-3-70b-FP16-Q6_K 461/1132 40.72
Llama-3-70b-FP16-Q8_0 502/1132 44.35

History

Model Correct Score (%)
WizardLM-2-7b 143/381 37.53
Open-Hermes-2.5-7b 148/381 38.85
Mistral-7b-Inst-v0.3-q8 120/381 31.50
Llama-3-8b-q4_K_M 82/381 21.52
Llama-3-8b-q8 94/381 24.67
Llama-3-8b-SPPO-Iter-3 70/381 18.37
Hermes-2-Theta-Llama-3-8b 155/381 40.68
Yi-1.5-9b-32k-q8 69/381 18.11
Phi-Medium-128k-q8 119/381 31.23
Mixtral-8x7b-Instruct-Q8 116/381 30.45
Dolphin-Mixtral-2.5-8x7b 155/381 40.68
Nous-Capybara-34b 105/381 27.56
Yi-1.5-34B-32K-Q8 174/381 45.67
Command-R-v01-Q8 40/381 10.50
Llama-3-70b-FP16-Q2_KXXS 174/381 45.67
Llama-3-70b-FP16-Q2_K 213/381 55.91
Llama-3-70b-FP16-Q4_K_M 232/381 60.89
Llama-3-70b-FP16-Q5_K_M 231/381 60.63
Llama-3-70b-FP16-Q6_K 231/381 60.63
Llama-3-70b-FP16-Q8_0 231/381 60.63

Other

Model Correct Score (%)
WizardLM-2-7b 375/924 40.58
Open-Hermes-2.5-7b 392/924 42.42
Mistral-7b-Inst-v0.3-q8 327/924 35.39
Llama-3-8b-q4_K_M 269/924 29.11
Llama-3-8b-q8 292/924 31.60
Llama-3-8b-SPPO-Iter-3 270/924 29.22
Hermes-2-Theta-Llama-3-8b 429/924 46.43
Yi-1.5-9b-32k-q8 227/924 24.57
Phi-Medium-128k-q8 388/924 41.99
Mixtral-8x7b-Instruct-Q8 355/924 38.42
Dolphin-Mixtral-2.5-8x7b 448/924 48.48
Nous-Capybara-34b 451/924 48.81
Yi-1.5-34B-32K-Q8 481/924 52.06
Command-R-v01-Q8 131/924 14.18
Llama-3-70b-FP16-Q2_KXXS 395/924 42.75
Llama-3-70b-FP16-Q2_K 472/924 51.08
Llama-3-70b-FP16-Q4_K_M 529/924 57.25
Llama-3-70b-FP16-Q5_K_M 552/924 59.74
Llama-3-70b-FP16-Q6_K 546/924 59.09
Llama-3-70b-FP16-Q8_0 556/924 60.17

Health

Model Correct Score (%)
WizardLM-2-7b 376/818 45.97
Open-Hermes-2.5-7b 356/818 43.52
Mistral-7b-Inst-v0.3-q8 294/818 35.94
Llama-3-8b-q4_K_M 216/818 26.41
Llama-3-8b-q8 263/818 32.15
Llama-3-8b-SPPO-Iter-3 229/818 28.00
Hermes-2-Theta-Llama-3-8b 388/818 47.43
Yi-1.5-9b-32k-q8 227/818 27.75
Phi-Medium-128k-q8 349/818 42.67
Mixtral-8x7b-Instruct-Q8 325/818 39.73
Dolphin-Mixtral-2.5-8x7b 367/818 44.87
Nous-Capybara-34b 348/818 42.54
Yi-1.5-34B-32K-Q8 468/818 57.21
Command-R-v01-Q8 110/818 13.45
Llama-3-70b-FP16-Q2_KXXS 406/818 49.63
Llama-3-70b-FP16-Q2_K 502/818 61.37
Llama-3-70b-FP16-Q4_K_M 542/818 66.26
Llama-3-70b-FP16-Q5_K_M 551/818 67.36
Llama-3-70b-FP16-Q6_K 546/818 66.75
Llama-3-70b-FP16-Q8_0 544/818 66.50

Economics

Model Correct Score (%)
WizardLM-2-7b 391/844 46.33
Open-Hermes-2.5-7b 407/844 48.22
Mistral-7b-Inst-v0.3-q8 343/844 40.64
Llama-3-8b-q4_K_M 307/844 36.37
Llama-3-8b-q8 309/844 36.61
Llama-3-8b-SPPO-Iter-3 249/844 29.50
Hermes-2-Theta-Llama-3-8b 448/844 53.08
Yi-1.5-9b-32k-q8 290/844 34.36
Phi-Medium-128k-q8 369/844 43.72
Mixtral-8x7b-Instruct-Q8 415/844 49.17
Dolphin-Mixtral-2.5-8x7b 462/844 54.74
Nous-Capybara-34b 451/844 53.44
Yi-1.5-34B-32K-Q8 519/844 61.49
Command-R-v01-Q8 198/844 23.46
Llama-3-70b-FP16-Q2_KXXS 494/844 58.53
Llama-3-70b-FP16-Q2_K 565/844 66.94
Llama-3-70b-FP16-Q4_K_M 606/844 71.80
Llama-3-70b-FP16-Q5_K_M 623/844 73.82
Llama-3-70b-FP16-Q6_K 614/844 72.75
Llama-3-70b-FP16-Q8_0 625/844 74.05

Math

Model Correct Score (%)
WizardLM-2-7b 379/1351 28.05
Open-Hermes-2.5-7b 423/1351 31.31
Mistral-7b-Inst-v0.3-q8 399/1351 29.53
Llama-3-8b-q4_K_M 202/1351 14.95
Llama-3-8b-q8 167/1351 12.36
Llama-3-8b-SPPO-Iter-3 392/1351 29.02
Hermes-2-Theta-Llama-3-8b 509/1351 37.68
Yi-1.5-9b-32k-q8 370/1351 27.39
Phi-Medium-128k-q8 299/1351 22.13
Mixtral-8x7b-Instruct-Q8 475/1351 35.16
Dolphin-Mixtral-2.5-8x7b 487/1351 36.04
Nous-Capybara-34b 347/1351 25.68
Yi-1.5-34B-32K-Q8 467/1351 34.57
Command-R-v01-Q8 166/1351 12.29
Llama-3-70b-FP16-Q2_KXXS 336/1351 24.87
Llama-3-70b-FP16-Q2_K 436/1351 32.27
Llama-3-70b-FP16-Q4_K_M 529/1351 39.16
Llama-3-70b-FP16-Q5_K_M 543/1351 40.19
Llama-3-70b-FP16-Q6_K 547/1351 40.49
Llama-3-70b-FP16-Q8_0 532/1351 39.38

Physics

Model Correct Score (%)
WizardLM-2-7b 344/1299 26.48
Open-Hermes-2.5-7b 351/1299 27.02
Mistral-7b-Inst-v0.3-q8 338/1299 26.02
Llama-3-8b-q4_K_M 168/1299 12.93
Llama-3-8b-q8 178/1299 13.70
Llama-3-8b-SPPO-Iter-3 312/1299 24.02
Hermes-2-Theta-Llama-3-8b 417/1299 32.10
Yi-1.5-9b-32k-q8 321/1299 24.71
Phi-Medium-128k-q8 312/1299 24.02
Mixtral-8x7b-Instruct-Q8 442/1299 34.03
Dolphin-Mixtral-2.5-8x7b 410/1299 31.56
Nous-Capybara-34b 404/1299 31.10
Yi-1.5-34B-32K-Q8 483/1299 37.18
Command-R-v01-Q8 166/1299 12.78
Llama-3-70b-FP16-Q2_KXXS 382/1299 29.41
Llama-3-70b-FP16-Q2_K 478/1299 36.80
Llama-3-70b-FP16-Q4_K_M 541/1299 41.65
Llama-3-70b-FP16-Q5_K_M 565/1299 43.49
Llama-3-70b-FP16-Q6_K 550/1299 42.34
Llama-3-70b-FP16-Q8_0 544/1299 41.88

Computer Science

Model Correct Score (%)
WizardLM-2-7b 137/410 33.41
Open-Hermes-2.5-7b 166/410 40.49
Mistral-7b-Inst-v0.3-q8 120/410 29.27
Llama-3-8b-q4_K_M 105/410 25.61
Llama-3-8b-q8 125/410 30.49
Llama-3-8b-SPPO-Iter-3 130/410 31.71
Hermes-2-Theta-Llama-3-8b 169/410 41.22
Yi-1.5-9b-32k-q8 96/410 23.41
Phi-Medium-128k-q8 131/410 31.95
Mixtral-8x7b-Instruct-Q8 150/410 36.59
Dolphin-Mixtral-2.5-8x7b 177/410 43.17
Nous-Capybara-34b 134/410 32.68
Yi-1.5-34B-32K-Q8 191/410 46.59
Command-R-v01-Q8 61/410 14.88
Llama-3-70b-FP16-Q2_KXXS 186/410 45.37
Llama-3-70b-FP16-Q2_K 199/410 48.54
Llama-3-70b-FP16-Q4_K_M 239/410 58.29
Llama-3-70b-FP16-Q5_K_M 241/410 58.78
Llama-3-70b-FP16-Q6_K 240/410 58.54
Llama-3-70b-FP16-Q8_0 238/410 58.05

Philosophy

Model Correct Score (%)
WizardLM-2-7b 170/499 34.07
Open-Hermes-2.5-7b 200/499 40.08
Mistral-7b-Inst-v0.3-q8 175/499 35.07
Llama-3-8b-q4_K_M 152/499 30.46
Llama-3-8b-q8 161/499 32.26
Llama-3-8b-SPPO-Iter-3 142/499 28.46
Hermes-2-Theta-Llama-3-8b 194/499 38.88
Yi-1.5-9b-32k-q8 114/499 22.85
Phi-Medium-128k-q8 187/499 37.47
Mixtral-8x7b-Instruct-Q8 194/499 38.88
Dolphin-Mixtral-2.5-8x7b 212/499 42.48
Nous-Capybara-34b 197/499 39.48
Yi-1.5-34B-32K-Q8 257/499 51.50
Command-R-v01-Q8 160/499 32.06
Llama-3-70b-FP16-Q2_KXXS 200/499 40.08
Llama-3-70b-FP16-Q2_K 258/499 51.70
Llama-3-70b-FP16-Q4_K_M 282/499 56.51
Llama-3-70b-FP16-Q5_K_M 281/499 56.31
Llama-3-70b-FP16-Q6_K 283/499 56.71
Llama-3-70b-FP16-Q8_0 278/499 55.71

Engineering

Model Correct Score (%)
WizardLM-2-7b 196/969 20.23
Open-Hermes-2.5-7b 193/969 19.92
Mistral-7b-Inst-v0.3-q8 198/969 20.43
Llama-3-8b-q4_K_M 149/969 15.38
Llama-3-8b-q8 166/969 17.13
Llama-3-8b-SPPO-Iter-3 165/969 17.03
Hermes-2-Theta-Llama-3-8b 245/969 25.28
Yi-1.5-9b-32k-q8 190/969 19.61
Phi-Medium-128k-q8 183/969 18.89
Mixtral-8x7b-Instruct-Q8 234/969 24.15
Dolphin-Mixtral-2.5-8x7b 236/969 24.35
Nous-Capybara-34b 393/969 40.56
Yi-1.5-34B-32K-Q8 408/969 42.11
Command-R-v01-Q8 145/969 14.96
Llama-3-70b-FP16-Q2_KXXS 326/969 33.64
Llama-3-70b-FP16-Q2_K 375/969 38.70
Llama-3-70b-FP16-Q4_K_M 394/969 40.66
Llama-3-70b-FP16-Q5_K_M 417/969 43.03
Llama-3-70b-FP16-Q6_K 406/969 41.90
Llama-3-70b-FP16-Q8_0 398/969 41.07

Totals

Model Total Correct Total Score (%)
WizardLM-2-7b 4173/12032 34.68
Open-Hermes-2.5-7b 4330/12032 35.99
Mistral-7b-Inst-v0.3-q8 3825/12032 31.79
Llama-3-8b-q4_K_M 2862/12032 23.79
Llama-3-8b-q8 3058/12032 25.42
Llama-3-8b-SPPO-Iter-3 3210/12032 26.68
Hermes-2-Theta-Llama-3-8b 4799/12032 39.89
Yi-1.5-9b-32k-q8 3066/12032 25.48
Phi-Medium-128k-q8 3679/12032 30.58
Mixtral-8x7b-Instruct-Q8 4335/12032 36.03
Dolphin-Mixtral-2.5-8x7b 4846/12032 40.27
Nous-Capybara-34b 4827/12032 40.12
Yi-1.5-34B-32K-Q8 5571/12032 46.30
Command-R-v01-Q8 1847/12032 15.35
Llama-3-70b-FP16-Q2_KXXS 4849/12032 40.30
Llama-3-70b-FP16-Q2_K 5722/12032 47.56
Llama-3-70b-FP16-Q4_K_M 6445/12032 53.57
Llama-3-70b-FP16-Q5_K_M 6571/12032 54.61
Llama-3-70b-FP16-Q6_K 6480/12032 53.86
Llama-3-70b-FP16-Q8_0 6509/12032 54.10