Llama 4: Leading intelligence.
Unrivaled speed and efficiency.
Catch up on Meta Connect 2025
Build with Llama 4
Llama 4 Capabilities

Benchmarks
Inference Cost
Image Reasoning
Image Understanding
Coding
Reasoning & Knowledge
Multilingual
Long context
Llama 4 Maverick
$0.19-$0.49⁵
73.4
73.7
90.0
94.4
43.4
80.5
69.8
84.6
54.0/46.4
50.8/46.7
Gemini 2.0 Flash
$0.17
71.7
73.1
88.3
-
34.5
77.6
60.1
-
48.4/39.80⁴
45.5/39.6⁴
$0.48
No multimodal support
45.8/49.2³
81.2
68.4
-
Context window is 128K
$4.38
69.1
63.8
85.7
92.8
32.3³
-
53.6
81.5
Context window is 128K
For Llama model results, we report 0 shot evaluation with temperature = 0 and no majority voting or parallel test time compute. For high-variance benchmarks (GPQA Diamond, LiveCodeBench), we average over multiple generations to reduce uncertainty.
For non-Llama models, we source the highest available self-reported eval results, unless otherwise specified. We only include evals from models that have reproducible evals (via API or open weights), and we only include non-thinking models. Cost estimates are sourced from Artificial Analysis for non-llama models.
DeepSeek v3.1's date range is unknown (49.2), so we provide our internal result (45.8) on the defined date range. Results for GPT-4o are sourced from the LCB leaderboard.
Specialized long context evals are not traditionally reported for generalist models, so we share internal runs to showcase llama's frontier performance.
$0.19/Mtok (3:1 blended) is our cost estimate for Llama 4 Maverick assuming distributed inference. On a single host, we project the model can be served at $0.30 - $0.49/Mtok (3:1 blended)
Inference Cost
Image Reasoning
Image Understanding
Coding
Reasoning & Knowledge
Multilingual
Long context
Llama 4 Maverick
$0.19-$0.49⁵
73.4
73.7
90.0
94.4
43.4
80.5
69.8
84.6
54.0/46.4
50.8/46.7
Gemini 2.0 Flash
$0.17
71.7
73.1
88.3
-
34.5
77.6
60.1
-
48.4/39.80⁴
45.5/39.6⁴
$0.48
No multimodal support
45.8/49.2³
81.2
68.4
-
Context window is 128K
$4.38
69.1
63.8
85.7
92.8
32.3³
-
53.6
81.5
Context window is 128K
For Llama model results, we report 0 shot evaluation with temperature = 0 and no majority voting or parallel test time compute. For high-variance benchmarks (GPQA Diamond, LiveCodeBench), we average over multiple generations to reduce uncertainty.
For non-Llama models, we source the highest available self-reported eval results, unless otherwise specified. We only include evals from models that have reproducible evals (via API or open weights), and we only include non-thinking models. Cost estimates are sourced from Artificial Analysis for non-llama models.
DeepSeek v3.1's date range is unknown (49.2), so we provide our internal result (45.8) on the defined date range. Results for GPT-4o are sourced from the LCB leaderboard.
Specialized long context evals are not traditionally reported for generalist models, so we share internal runs to showcase llama's frontier performance.
$0.19/Mtok (3:1 blended) is our cost estimate for Llama 4 Maverick assuming distributed inference. On a single host, we project the model can be served at $0.30 - $0.49/Mtok (3:1 blended)
Docs
The guides and resources you need to build with Llama 4.Cookbooks
Check out our collection of Llama recipes to help you get started faster.Case studies
See how other innovators are building with Llama.Our partner ecosystem
Stay up-to-date
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.