We evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs, we evaluated performance on benchmarks for image understanding and visual reasoning. In addition, we performed extensive human evaluations that compare Llama 3.2 with competing models in real-world scenarios.
General
IFEval
Math
Reasoning
Tool use
Long context
Multilingual
Llama 3.2 1B
49.3
41.6
16.8
59.5
44.4
30.6
59.4
27.2
41.2
25.7
13.5
38.0
20.3
75.0
24.5
Llama 3.2 3B
63.4
40.1
19.0
77.4
77.7
48.0
78.6
32.8
69.8
67.0
34.3
63.3
19.8
84.7
58.2
57.8
31.2
13.9
61.9
62.5
23.8
76.7
27.5
61.1
27.4
21.0
-
-
-
40.2
69.0
34.5
12.8
59.2
86.2
44.2
87.4
31.9
81.4
58.4
26.1
39.2
11.3
52.7
49.8
General
IFEval
Math
Reasoning
Tool use
Long context
Multilingual
Llama 3.2 1B
49.3
41.6
16.8
59.5
44.4
30.6
59.4
27.2
41.2
25.7
13.5
38.0
20.3
75.0
24.5
Llama 3.2 3B
63.4
40.1
19.0
77.4
77.7
48.0
78.6
32.8
69.8
67.0
34.3
63.3
19.8
84.7
58.2
57.8
31.2
13.9
61.9
62.5
23.8
76.7
27.5
61.1
27.4
21.0
-
-
-
40.2
69.0
34.5
12.8
59.2
86.2
44.2
87.4
31.9
81.4
58.4
26.1
39.2
11.3
52.7
49.8
downloads on Hugging Face to date
growth since 2023
Ihab Tarazi, CTO, Dell Technologies
Learn how partners across the community are putting Llama to use in real life.
Stay up-to-date
Subscribe to our newsletter to keep up with the latest Llama updates, releases and more.