We evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs, we evaluated performance on benchmarks for image understanding and visual reasoning. In addition, we performed extensive human evaluations that compare Llama with competing models in real-world scenarios.
General
MMLU Chat
(0-shot, CoT)
MMLU PRO
(5-shot, CoT)
Instruction Following
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
MATH
(0-sho, CoT)
Reasoning
GPQA Diamond
(0-shot, CoT)
Tool use
BFCL v2
(0-shot)
Long context
NIH/Multi-needle
Multilingual
Multilingual MGSM
(0-shot)
Pricing*
1M Input tokens
(Cheapest among providers)*
1M Output tokens
(Cheapest among providers)*
86.0
66.4
87.5
80.5
86.0
67.8
48.0
77.5
97.5
86.9
$0.1
$0.4
86.0
68.9
92.1
88.4
87.6
77.0
50.5
77.3
97.5
91.1
$0.1
$0.4
Amazon Nova
Pro
85.9
-
92.1
89.0
-
76.6
-
-
-
-
$0.80
$3.20
88.6
73.4
88.6
89.0
88.6
73.9
49.0
81.1
98.1
91.6
$1.0
$1.8
Gemini Pro
1.5
87.1
76.1
81.9
89.0
87.8
82.9
53.5
80.3
94.7
89.6
$1.30
$5.0
GPT-4o
87.5
73.8
84.6
86.0
83.9
76.9
47.5
74.0
-
90.6
2.5$
10.0$
Claude 3.5
Sonnet
88.9
77.8
89.3
93.7
86.8
78.3
65.0
79.3
99.4
92.8
$3.0
$15.0
* API Pricing based on publicly available data on Artificial Analysis as of 12/3/24.
General
MMLU Chat
(0-shot, CoT)
MMLU PRO
(5-shot, CoT)
Instruction Following
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
MATH
(0-sho, CoT)
Reasoning
GPQA Diamond
(0-shot, CoT)
Tool use
BFCL v2
(0-shot)
Long context
NIH/Multi-needle
Multilingual
Multilingual MGSM
(0-shot)
Pricing*
1M Input tokens
(Cheapest among providers)*
1M Output tokens
(Cheapest among providers)*
86.0
66.4
87.5
80.5
86.0
67.8
48.0
77.5
97.5
86.9
$0.1
$0.4
86.0
68.9
92.1
88.4
87.6
77.0
50.5
77.3
97.5
91.1
$0.1
$0.4
Amazon Nova
Pro
85.9
-
92.1
89.0
-
76.6
-
-
-
-
$0.80
$3.20
88.6
73.4
88.6
89.0
88.6
73.9
49.0
81.1
98.1
91.6
$1.0
$1.8
Gemini Pro
1.5
87.1
76.1
81.9
89.0
87.8
82.9
53.5
80.3
94.7
89.6
$1.30
$5.0
GPT-4o
87.5
73.8
84.6
86.0
83.9
76.9
47.5
74.0
-
90.6
2.5$
10.0$
Claude 3.5
Sonnet
88.9
77.8
89.3
93.7
86.8
78.3
65.0
79.3
99.4
92.8
$3.0
$15.0
* API Pricing based on publicly available data on Artificial Analysis as of 12/3/24.
downloads on Hugging Face to date
growth since 2023
Ihab Tarazi, CTO, Dell Technologies
Learn how partners across the community are putting Llama to use in real life.
Stay up-to-date
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.