We evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs, we evaluated performance on benchmarks for image understanding and visual reasoning. In addition, we performed extensive human evaluations that compare Llama 3.2 with competing models in real-world scenarios.
General
IFEval
Math
Reasoning
Tool use
Long context
Multilingual
Llama 3.2 1B
49.3
41.6
16.8
59.5
44.4
30.6
59.4
27.2
41.2
25.7
13.5
38.0
20.3
75.0
24.5
Llama 3.2 3B
63.4
40.1
19.0
77.4
77.7
48.0
78.6
32.8
69.8
67.0
34.3
63.3
19.8
84.7
58.2
57.8
31.2
13.9
61.9
62.5
23.8
76.7
27.5
61.1
27.4
21.0
-
-
-
40.2
69.0
34.5
12.8
59.2
86.2
44.2
87.4
31.9
81.4
58.4
26.1
39.2
11.3
52.7
49.8
General
IFEval
Math
Reasoning
Tool use
Long context
Multilingual
Llama 3.2 1B
49.3
41.6
16.8
59.5
44.4
30.6
59.4
27.2
41.2
25.7
13.5
38.0
20.3
75.0
24.5
Llama 3.2 3B
63.4
40.1
19.0
77.4
77.7
48.0
78.6
32.8
69.8
67.0
34.3
63.3
19.8
84.7
58.2
57.8
31.2
13.9
61.9
62.5
23.8
76.7
27.5
61.1
27.4
21.0
-
-
-
40.2
69.0
34.5
12.8
59.2
86.2
44.2
87.4
31.9
81.4