AI for Science Datasets: Results


In late 2025, Renaissance Philanthropy launched the AI for Science Datasets Request for Proposals in collaboration with the UK Department for Science, Innovation and Technology (DSIT). Our goal was to identify high-value, strategically important dataset concepts that could underpin the next generation of AI-enabled scientific discovery.

This initiative was anchored to DSIT’s AI for Science Strategy, and driven by a shared conviction: the primary constraint on AI-driven scientific breakthroughs is not model capability, but the structure, availability, and governance of scientific data.

The competition was designed to surface and de-risk ideas for transformative new scientific datasets and to position the strongest concepts for larger follow-on funding from government, philanthropy, and other partners.

Context

The history of AI for Science is built on a foundation of transformative datasets. The Protein Data Bank, an openly accessible repository, was the public dataset that enabled the Nobel Prize-winning AlphaFold and RFdiffusion models. More recent datasets, such as the OpenBind consortium, promise to generate large high-quality datasets on protein–small molecule interactions, accelerating the discovery of life-saving therapies.

This initiative was launched following the publication of DSIT’s AI for Science Strategy, driven by the shared vision that high-quality, targeted data is critical for realising the transformative potential of AI in scientific discovery.

Context
The Process

The Process

By the Numbers

79 proposals submitted across five priority domains

25 semifinalists selected and awarded £500 honoraria

10 winners selected and awarded £5,000 honoraria

117 participants across five proposal refinement workshops

Proposals were received from leading universities, national labs, biotech companies, startups, and international consortia, spanning engineering biology, fusion energy, materials science, medical research, and quantum technologies.


Priority Domains

Proposals were invited across five priority areas aligned with DSIT’s AI for Science Strategy:

  • Engineering Biology 

  • Medical Research 

  • Materials Science 

  • Fusion Energy 

  • Quantum Technologies 


How It Worked

The competition operated in two phases.

Phase 1: Open Call. Researchers, labs, companies, consortia, and independent teams worldwide were invited to submit 4-page proposals describing what dataset should exist and why. Each proposal required at least one collaboration with a UK-based organisation or individual. Submissions were reviewed by 12 domain-specific technical experts across the five priority areas. Twenty-five proposals were selected as semifinalists.

Phase 2: Refinement and Selection. Semifinalists participated in structured workshops to deepen their proposals through stress-testing collaboration plans, data infrastructure, governance, and long-term stewardship. A Blue Ribbon Committee of six leaders in AI, data infrastructure, and science policy then assessed the refined proposals and selected the final ten.


Blue Ribbon Committee

The final selection was made by an independent Blue Ribbon Committee whose members have affiliation with the following organisations: DSIT, UK Biobank, Google DeepMind, Convergent Research, Renaissance Philanthropy, and the Wellcome Trust.

The Winners

These 10 proposals represent the strongest dataset concepts to emerge from the competition. Each addresses a critical data bottleneck in its field, with the potential to unlock new AI capabilities for scientific discovery. All winners are receiving ongoing support from Renaissance Philanthropy to develop partnerships, refine budgets, and prepare for funding calls from DSIT and other funders.

Engineering Biology

PePcube

Lead: Rebecca Birolo

PePcube aims to generate the largest systematically measured dataset of peptide developability properties to unlock the design of therapeutic peptides with co-optimised target affinity and other properties necessary for their administration and manufacturability.

Unlocking Nature’s Regulatory Code (Darwin Tree of Life)

Lead: Mark Blaxter

The Darwin Tree of Life project will create the world's richest AI-ready genomic dataset, spanning 20,000 eukaryotic species' genomes plus rich functional datasets, to power a new generation of genomic large language models capable of predicting how genome sequence controls all of life and enabling the next wave of innovation in engineering biology.

The Binding Affinity Bottleneck in Protein Design

Lead: Jakub Lála

By measuring protein binding at unprecedented scale, we aim to build the open dataset powering the next generation of post-AlphaFold AI — moving from predicting protein structure to predicting function for drug discovery

Preparing the World’s Biological Data for the Age of AI (Covalent.Bio)

Lead: Nick Schaum

Covalent is transforming the world’s fragmented biological data into a single, continuously updated semantic layer ready for AI-driven discovery.

Medical Research

Beyond Opportunistic Datasets: Systematic Generation of Training Data for Immunogenicity Algorithms

Lead: Christopher Thorpe

Being able to predict immunogenicity is critical for the design and safety of next-generation and personalised immune therapies for autoimmunity, infection and cancer. Current approaches are limited by data quality and diversity; we propose collecting higher-quality, more diverse data through active learning prioritisation and closed-loop lab automation.

BioImageNet-UK

Lead: Janos Kriston-Vizi

BioImageNet-UK bridges the critical 'Validation Gap' in TechBio by providing a 50TB ground truth architecture to rigorously audit and de-risk AI foundation models in generalist bioimaging.

An Adversarial Benchmark for AI Drug Discovery

Lead: Antonio Ruiz-Gonzalez

Ignota Labs is using AI to map the hidden chemical boundaries where a drug's activity suddenly shifts, creating an open-source database to improve clinical translation rates of new therapies for patients who need them.

An FRO for Receptor Abundance, Specificity, and Internalisation

Lead: Louis “Bobby” Hollingsworth & Becca Carlson

In five years, the Deliverome Project will build public AI-ready datasets for directing medicines to specific tissues, cell types, and diseases—multiplying by 10× the number of validated targets for precision delivery.

Advanced Materials

Open Defects Atlas

Lead: Seán Kavanagh

The Open Defects Atlas will create the first large-scale, purpose-built dataset of atomic defects to enable AI-accelerated materials discovery and design across semiconductors, quantum technologies, nuclear fusion and beyond; positioning the UK as a global leader in AI-driven materials innovation.

Nuclear Fusion

An Open, AI-Ready Framework and Reference Dataset for Fusion Time-Series Data

Lead: Federico Felici

Enabling the fusion energy community to leverage AI and large-scale physics validation thanks to an open framework for fusion time-series data curation.

The Winners
What's Next

What’s Next

The 10 winning proposals are now being supported in building coalitions, refining budgets, and developing execution plans. They are also being supported to apply for funding calls from DSIT and other potential funders.

For questions about this initiative, please contact datasets@renphil.org