Mathbench

Towards Evaluating Natural Language Proofs

The Project

The MathBench project aims to advance the evaluation of AI models in mathematical reasoning by introducing a novel dataset of annotated proofs. This dataset includes theorems paired with multiple distinct proofs, each broken down into key steps and annotated with summaries, erroneous arguments, and logical structure metadata. By leveraging MathBench, the project seeks to assess the ability of AI models to identify critical proof steps, correct errors, and differentiate between valid and invalid proofs, thereby learning about large language models' abilities in mathematical reasoning and understanding.

The Team

Siddharth Bhat is a PhD student at Cambridge. He is an expert in Formal Verification, the implementation of functional programming languages and applied category theory. He is one of the top 20 contributors to the lean theorem prover. He also has deep experience in HPC computing and loop optimisation. His long-term vision is to build scalable systems for Mathematical Theorem proving, and he has published a variety of research towards this goal.

Alex is a Mathematics and ML PhD student at the Georgia Institute of Technology. His research studies the theory and practice of generative modelling with emphasis on improving reasoning via RL and synthetic data. He has interned at Meta, Microsoft, and Google and co-founded the open-source RLHF research group CarperAI.

Stella Biderman is the Executive Director of EleutherAI, a grassroots research collective turned non-profit research institute that started the modern open-source AI movement. Stella's work on models such as GPT-Neo, Pythia, BLOOM, OpenFold, and VQGAN-CLIP has massively increased the accessibility of foundation models while setting the high water mark for transparency and open science in AI. In addition to developing new models, Stella has also spent substantial time working on developing open-source AI libraries, building and documenting public datasets, and doing interpretability research.

Pawan Sasanka Ammanamanchi is a Researcher at Ritual and a volunteer researcher at EleutherAI. He holds a Master's degree from IIIT Hyderabad and specialises in evaluating large language models. His contributions include work on the BLOOM model and the GEM benchmark. He is particularly passionate about advancing AI applications in mathematics.