iTnews Asia
  • Home
  • News
  • Data and Analytics

Researchers launch new benchmark to test capabilities of AI Models

Researchers launch new benchmark to test capabilities of AI Models

Presents hundreds of expert-level, unpublished mathematics problems.

By Abbinaya Kuzhanthaivel on Nov 13, 2024 2:00PM

Epoch AI, a nonprofit research organisation that investigates AI trends and solutions to ensure that its development is aligned with ethical principles, has launched a new AI benchmark to test large language models (LLMs) on their reasoning and mathematical problem-solving skills.

Called FrontierMath, this tool features hundreds of expert-level, unpublished mathematics problems that could serve as an ongoing benchmark for tracking the progress of AI in complex mathematical reasoning.

The research group said these range from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory.

"We developed it through collaboration with over 60 mathematicians from leading institutions, including professors, IMO question writers, and Fields medalists," it added.

It added that even the most advanced large language models (LLMs) have scored under two percent on their new benchmark.

Epoch AI claims that current benchmarks like GSM8K and MATH are inadequate due to data contamination and the tendency of AI models to achieve unnaturally high scores.

FrontierMath is said to address these issues by introducing a set of unique, unpublished problems, reducing the risks of data contamination. The problems are designed to be "guess-proof," meaning they can only be solved through strong, logical reasoning, making accidental answers very unlikely.

As the research paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1 percent chance of guessing correctly without the proper reasoning.

Epoch AI asserts that to truly gauge AI's capabilities, benchmarks should focus on creative problem-solving that requires sustained reasoning over multiple steps. Many experts in the field agree that current benchmarks fall short in accurately assessing the depth of an AI model's capabilities.

The group aims to further collaborate with mathematics and the AI research community to refine and expand this benchmark, to ensure it remains relevant and challenging for future AI systems. It plans to conduct regular evaluations to provide a standardised measure of progress, and evaluate how reasoning abilities improve over time and with scale.

To reach the editorial team on your feedback, story ideas and pitches, contact them here.
© iTnews Asia
Tags:
data and analytics epoch ai

Related Articles

  • FairPrice Group modernises retail operations for evolving shopper needs
  • Fragmented systems costing Singapore USD 1 billion a year
  • Malaysia BIG Caring Group streamlines retail operations with SAP
  • Carousell unifies financial processes with cloud based ERP system
Share on Twitter Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

Sony Pictures India uses data lake to modernise downstream applications

Sony Pictures India uses data lake to modernise downstream applications

FairPrice Group modernises retail operations for evolving shopper needs

FairPrice Group modernises retail operations for evolving shopper needs

DBS Bank leverages data to raise operational efficiency and customer engagement

DBS Bank leverages data to raise operational efficiency and customer engagement

Fragmented systems costing Singapore USD 1 billion a year

Fragmented systems costing Singapore USD 1 billion a year

All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of Lighthouse Independent Media's Privacy Policy and Terms & Conditions.