As measured on over 150 benchmark datasets that span a wide range of languages and extensive human evaluations.
General
MMLU
(CoT)
MMLU PRO
(5-shot, CoT)
IFEval
Code
HumanEval
(0-shot)
MBPP EvalPlus
(base) (0-shot)
Math