New Benchmarks Emerge for Evaluating LLM Capabilities
LemmaBench and DARE-bench are new benchmarks for evaluating Large Language Models (LLMs) in mathematics and data science. LemmaBench focuses on research-level mathematics, while DARE-bench targets complex data science tasks. Both benchmarks highlight performance gaps in current LLMs and the imp