iTnews Asia
  • Home
  • News
  • Data and Analytics

Researchers launch new benchmark to test capabilities of AI Models

Researchers launch new benchmark to test capabilities of AI Models

Presents hundreds of expert-level, unpublished mathematics problems.

By Abbinaya Kuzhanthaivel on Nov 13, 2024 2:00PM

Epoch AI, a nonprofit research organisation that investigates AI trends and solutions to ensure that its development is aligned with ethical principles, has launched a new AI benchmark to test large language models (LLMs) on their reasoning and mathematical problem-solving skills.

Called FrontierMath, this tool features hundreds of expert-level, unpublished mathematics problems that could serve as an ongoing benchmark for tracking the progress of AI in complex mathematical reasoning.

The research group said these range from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory.

"We developed it through collaboration with over 60 mathematicians from leading institutions, including professors, IMO question writers, and Fields medalists," it added.

It added that even the most advanced large language models (LLMs) have scored under two percent on their new benchmark.

Epoch AI claims that current benchmarks like GSM8K and MATH are inadequate due to data contamination and the tendency of AI models to achieve unnaturally high scores.

FrontierMath is said to address these issues by introducing a set of unique, unpublished problems, reducing the risks of data contamination. The problems are designed to be "guess-proof," meaning they can only be solved through strong, logical reasoning, making accidental answers very unlikely.

As the research paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1 percent chance of guessing correctly without the proper reasoning.

Epoch AI asserts that to truly gauge AI's capabilities, benchmarks should focus on creative problem-solving that requires sustained reasoning over multiple steps. Many experts in the field agree that current benchmarks fall short in accurately assessing the depth of an AI model's capabilities.

The group aims to further collaborate with mathematics and the AI research community to refine and expand this benchmark, to ensure it remains relevant and challenging for future AI systems. It plans to conduct regular evaluations to provide a standardised measure of progress, and evaluate how reasoning abilities improve over time and with scale.

To reach the editorial team on your feedback, story ideas and pitches, contact them here.
© iTnews Asia
Tags:
data and analytics epoch ai

Related Articles

  • As AI moves to production, enterprises must confront limits of current stacks
  • Trust is the catalyst for Agentic AI innovation
  • Sunday unifies operations to support multi-market insurance expansion
  • Good data will ultimately define Agentic AI’s success
Share on Twitter Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

As AI moves to production, enterprises must confront limits of current stacks

As AI moves to production, enterprises must confront limits of current stacks

Bupa elevates digital, data chief to APAC executive team

Bupa elevates digital, data chief to APAC executive team

DBS Bank leverages data to raise operational efficiency and customer engagement

DBS Bank leverages data to raise operational efficiency and customer engagement

Jollibee Group unifies feedback data to enhance customer experiences

Jollibee Group unifies feedback data to enhance customer experiences

All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of Lighthouse Independent Media's Privacy Policy and Terms & Conditions.