As measured on over 150 benchmark datasets that span a wide range of languages and extensive human evaluations.
General
MMLU
(CoT)
MMLU PRO
(5-shot, CoT)
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math
GSM8K
(8-shot, CoT)
MATH
(0-shot, CoT)
Reasoning
ARC Challenge
(0-shot)
GPQA
(0-shot, CoT)
Tool use
API-Bank
(0-shot)
BFCL
Gorilla Benchmark API Bench
Nexus
(0-shot)
Multilingual
Multilingual MGSM