OpenAI open sourcing new GPT-4 Turbo evals



OpenAI today announced that it is open-sourcing a GitHub repository to run popular evals on various models including the new GPT-4 Turbo.

The company has improved writing, math, logical reasoning, and coding capabilities with the new GPT-4 Turbo. The model comes with responses that are more direct and less verbose. The responses will have more conversational language compared to the predecessor.


OpenAI GPT-4 Turbo (Image Credit: OpenAI)

The repository on Github contains a library of evaluating language models. These now include:

  • MMLU: Measuring Massive Multitask Language Understanding
  • MATH: Measuring Mathematical Problem Solving With the MATH Dataset
  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark,
  • DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
  • MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners
  • HumanEval: Evaluating Large Language Models Trained on Code
  • MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Evals are sensitive to prompting and there’s a variation in the formulations used in recent publications and libraries. These approaches are carryovers from evaluating base models and from models that were worse at following instructions.

Exit mobile version