We evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs, we evaluated performance on benchmarks for image understanding and visual reasoning. In addition, we performed extensive human evaluations that compare Llama with competing models in real-world scenarios.
General
MMLU Chat
(0-shot, CoT)
MMLU PRO
(5-shot, CoT)
Instruction Following
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
MATH
(0-sho, CoT)
Reasoning
GPQA Diamond
(0-shot, CoT)
Tool use
BFCL v2
(0-shot)
Long context
NIH/Multi-needle
Multilingual
Multilingual MGSM
(0-shot)
Pricing*
1M Input tokens
(Cheapest among providers)*
1M Output tokens
(Cheapest among providers)*
86.0
66.4
87.5
80.5
86.0
67.8
48.0
77.5
97.5
86.9
$0.1
$0.4
86.0
68.9
92.1
88.4
87.6
77.0
50.5
77.3
97.5
91.1
$0.1
$0.4
Amazon Nova
Pro
85.9
-
92.1
89.0
-
76.6
-
-
-
-
$0.80
$3.20
88.6
73.4
88.6
89.0
88.6
73.9
49.0
81.1
98.1
91.6
$1.0
$1.8
Gemini Pro
1.5
87.1
76.1
81.9
89.0
87.8
82.9
53.5
80.3
94.7
89.6
$1.30
$5.0
GPT-4o
87.5
73.8
84.6
86.0
83.9
76.9
47.5
74.0
-
90.6
2.5$
10.0$
Claude 3.5
Sonnet
88.9
77.8
89.3
93.7
86.8
78.3
65.0
79.3
99.4
92.8
$3.0
$15.0
* API Pricing based on publicly available data on Artificial Analysis as of 12/3/24.
General
MMLU Chat
(0-shot, CoT)
MMLU PRO
(5-shot, CoT)
Instruction Following
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
MATH
(0-sho, CoT)
Reasoning
GPQA Diamond
(0-shot, CoT)
Tool use
BFCL v2
(0-shot)
Long context
NIH/Multi-needle
Multilingual
Multilingual MGSM
(0-shot)
Pricing*
1M Input tokens
(Cheapest among providers)*
1M Output tokens
(Cheapest among providers)*
86.0
66.4
87.5
80.5
86.0
67.8
48.0
77.5
97.5
86.9
$0.1
$0.4
86.0
68.9
92.1
88.4
87.6
77.0
50.5
77.3
97.5
91.1
$0.1
$0.4
Amazon Nova
Pro
85.9
-
92.1
89.0
-
76.6
-
-
-
-
$0.80
$3.20
88.6
73.4
88.6
89.0
88.6
73.9
49.0
81.1
98.1
91.6
$1.0
$1.8
Gemini Pro
1.5
87.1
76.1
81.9
89.0
87.8
82.9
53.5
80.3
94.7
89.6
$1.30
$5.0
GPT-4o
87.5
73.8
84.6
86.0
83.9
76.9
47.5
74.0
-
90.6
2.5$
10.0$
Claude 3.5
Sonnet
88.9
77.8
89.3
93.7
86.8
78.3
65.0
79.3
99.4
92.8
$3.0
$15.0
* API Pricing based on publicly available data on Artificial Analysis as of 12/3/24.
This guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides. Additionally, you will find supplemental materials to further assist you while building with Llama.
Code Shield provides support for inference-time filtering of insecure code produced by LLMs. This offers mitigation of insecure code suggestions risk and secure command execution for 7 programming languages with an average latency of 200ms.
Stay up-to-date
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.