Sign in Subscribe

Topic

AI Benchmarks

A collection of 1 issue

New Benchmarks Emerge for Evaluating LLM Capabilities

LemmaBench and DARE-bench are new benchmarks for evaluating Large Language Models (LLMs) in mathematics and data science. LemmaBench focuses on research-level mathematics, while DARE-bench targets complex data science tasks. Both benchmarks highlight performance gaps in current LLMs and the imp