As measured on over 150 benchmark datasets that span a wide range of languages and extensive human evaluations.
General
MMLU
(CoT)
MMLU PRO
(5-shot, CoT)
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
GSM8K
(8-shot, CoT)
MATH
(0-shot, CoT)
Reasoning
ARC Challenge
(0-shot)
GPQA
(0-shot, CoT)
Tool use
API-Bank
(0-shot)
BFCL
Gorilla Benchmark API Bench
Nexus
(0-shot)
Multilingual
Multilingual MGSM
Llama 3.1
8B
73.0
48.3
80.4
72.6
72.8
84.5
51.9
83.4
32.8
82.6
76.1
8.2
38.5
68.9
Llama 3
8B - April
45.5
76.8
60.4
70.6
80.6
29.1
82.4
34.6
48.3
60.3
1.7
18.1
-
Llama 3.1
70B
86.0
66.4
87.5
80.5
86.0