As measured on over 150 benchmark datasets that span a wide range of languages and extensive human evaluations.
General
MMLU
(CoT)
MMLU PRO
(5-shot, CoT)
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
GSM8K
(8-shot, CoT)
MATH
(0-shot, CoT)
Reasoning
ARC Challenge
(0-shot)
GPQA
(0-shot, CoT)
Tool use
API-Bank
(0-shot)
BFCL
Gorilla Benchmark API Bench
Nexus
(0-shot)
Multilingual
Multilingual MGSM
Llama 3.1
8B
73.0
48.3
80.4
72.6
72.8
84.5
51.9
83.4
32.8
82.6
76.1
8.2
38.5
68.9
Llama 3
8B - April
45.5
76.8
60.4
70.6
80.6
29.1
82.4
34.6
48.3
60.3
1.7
18.1
-
Llama 3.1
70B
86.0
66.4
87.5
80.5
86.0
95.1
68.0
94.8
46.7
90.0
84.8
29.7
56.7
86.9
Llama 3
70B - April
80.9
63.4
82.9
81.7
82.5
93.0
51.0
94.4
39.5
85.1
83.0
14.7
47.8
-
Llama 3.1
405B
88.6
73.3
88.6
89.0
88.6
96.8
73.8
96.9
51.1
92.3
88.5
35.3
58.7
91.6
General
IFEval
Math
Reasoning
Tool use
Long context
Multilingual
Llama 3.2 1B
49.3
41.6
16.8
59.5
44.4
30.6
59.4
27.2
41.2
25.7
13.5
38.0
20.3
75.0
24.5
Llama 3.2 3B
63.4
40.1
19.0
77.4
77.7
48.0
78.6
32.8
69.8
67.0
34.3
63.3
19.8
84.7
58.2
57.8
31.2
13.9
61.9
62.5
23.8
76.7
27.5
61.1
27.4
21.0
-
-
-
40.2
69.0
34.5
12.8
59.2
86.2
44.2
87.4
31.9
81.4
58.4
26.1
39.2
11.3
52.7
49.8
8B
70B
405B
Input
$0.22
$0.30
-
$0.20
$0.60
$0.15
$0.57
$0.18
Output
$0.22
$0.61
-
$0.20
$0.60
$0.15
$0.57
$0.18
Input
$0.99
$2.68
$1.00
$0.90
$1.80
$0.90
$3.63
$0.88
Output
$0.99
$3.54
$3.00
$0.90
$1.80
$0.90
$3.63
$0.88
Input
$5.32
$5.33
$5.00
$3.00
$5.00
$3.00
$9.00
$5.00
Output
$16.00
$16.00
$15.00
$3.00
$16.00
$9.00
$9.00
$15.00
Stay up-to-date
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.