Inference Cost
Image Reasoning
Image Understanding
Coding
Reasoning & Knowledge
Multilingual
Long context
Llama 4 Maverick
$0.19-$0.49⁵
73.4
73.7
90.0
94.4
43.4
80.5
69.8
84.6
54.0/46.4
50.8/46.7
Gemini 2.0 Flash
$0.17
71.7
73.1
88.3
-
34.5
77.6
60.1
-
48.4/39.80⁴
45.5/39.6⁴
$0.48
No multimodal support
45.8/49.2³
81.2
68.4
-
Context window is 128K
$4.38
69.1
63.8
85.7
92.8
32.3³
-
53.6
81.5
Context window is 128K
For Llama model results, we report 0 shot evaluation with temperature = 0 and no majority voting or parallel test time compute. For high-variance benchmarks (GPQA Diamond, LiveCodeBench), we average over multiple generations to reduce uncertainty.
For non-Llama models, we source the highest available self-reported eval results, unless otherwise specified. We only include evals from models that have reproducible evals (via API or open weights), and we only include non-thinking models. Cost estimates are sourced from Artificial Analysis for non-llama models.
DeepSeek v3.1's date range is unknown (49.2), so we provide our internal result (45.8) on the defined date range. Results for GPT-4o are sourced from the LCB leaderboard.
Specialized long context evals are not traditionally reported for generalist models, so we share internal runs to showcase llama's frontier performance.
$0.19/Mtok (3:1 blended) is our cost estimate for Llama 4 Maverick assuming distributed inference. On a single host, we project the model can be served at $0.30 - $0.49/Mtok (3:1 blended)
Inference Cost
Image Reasoning
Image Understanding
Coding
Reasoning & Knowledge
Multilingual
Long context
Llama 4 Maverick
$0.19-$0.49⁵
73.4
73.7
90.0
94.4
43.4
80.5
69.8
84.6
54.0/46.4
50.8/46.7
Gemini 2.0 Flash
$0.17
71.7
73.1
88.3
-
34.5
77.6
60.1
-
48.4/39.80⁴
45.5/39.6⁴
$0.48
No multimodal support
45.8/49.2³
81.2
68.4
-
Context window is 128K
$4.38
69.1
63.8
85.7
92.8
32.3³
-
53.6
81.5
Context window is 128K
For Llama model results, we report 0 shot evaluation with temperature = 0 and no majority voting or parallel test time compute. For high-variance benchmarks (GPQA Diamond, LiveCodeBench), we average over multiple generations to reduce uncertainty.
For non-Llama models, we source the highest available self-reported eval results, unless otherwise specified. We only include evals from models that have reproducible evals (via API or open weights), and we only include non-thinking models. Cost estimates are sourced from Artificial Analysis for non-llama models.
DeepSeek v3.1's date range is unknown (49.2), so we provide our internal result (45.8) on the defined date range. Results for GPT-4o are sourced from the LCB leaderboard.
Specialized long context evals are not traditionally reported for generalist models, so we share internal runs to showcase llama's frontier performance.
$0.19/Mtok (3:1 blended) is our cost estimate for Llama 4 Maverick assuming distributed inference. On a single host, we project the model can be served at $0.30 - $0.49/Mtok (3:1 blended)
Stay up-to-date
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.