AI for Science Datasets: Results
In late 2025, Renaissance Philanthropy launched the AI for Science Datasets Request for Proposals in collaboration with the UK Department for Science, Innovation and Technology (DSIT). Our goal was to identify high-value, strategically important dataset concepts that could underpin the next generation of AI-enabled scientific discovery.
This initiative was anchored to DSIT’s AI for Science Strategy, and driven by a shared conviction: the primary constraint on AI-driven scientific breakthroughs is not model capability, but the structure, availability, and governance of scientific data.
The competition was designed to surface and de-risk ideas for transformative new scientific datasets and to position the strongest concepts for larger follow-on funding from government, philanthropy, and other partners.
Context
The history of AI for Science is built on a foundation of transformative datasets. The Protein Data Bank, an openly accessible repository, was the public dataset that enabled the Nobel Prize-winning AlphaFold and RFdiffusion models. More recent datasets, such as the OpenBind consortium, promise to generate large high-quality datasets on protein–small molecule interactions, accelerating the discovery of life-saving therapies.
This initiative was launched following the publication of DSIT’s AI for Science Strategy, driven by the shared vision that high-quality, targeted data is critical for realising the transformative potential of AI in scientific discovery.
The Process
By the Numbers
79 proposals submitted across five priority domains
25 semifinalists selected and awarded £500 honoraria
10 winners selected and awarded £5,000 honoraria
117 participants across five proposal refinement workshops
Proposals were received from leading universities, national labs, biotech companies, startups, and international consortia, spanning engineering biology, fusion energy, materials science, medical research, and quantum technologies.
Priority Domains
Proposals were invited across five priority areas aligned with DSIT’s AI for Science Strategy:
Engineering Biology
Medical Research
Materials Science
Fusion Energy
Quantum Technologies
How It Worked
The competition operated in two phases.
Phase 1: Open Call. Researchers, labs, companies, consortia, and independent teams worldwide were invited to submit 4-page proposals describing what dataset should exist and why. Each proposal required at least one collaboration with a UK-based organisation or individual. Submissions were reviewed by 12 domain-specific technical experts across the five priority areas. Twenty-five proposals were selected as semifinalists.
Phase 2: Refinement and Selection. Semifinalists participated in structured workshops to deepen their proposals through stress-testing collaboration plans, data infrastructure, governance, and long-term stewardship. A Blue Ribbon Committee of six leaders in AI, data infrastructure, and science policy then assessed the refined proposals and selected the final ten.
Blue Ribbon Committee
The final selection was made by an independent Blue Ribbon Committee whose members have affiliation with the following organisations: DSIT, UK Biobank, Google DeepMind, Convergent Research, Renaissance Philanthropy, and the Wellcome Trust.
The Winners
These 10 proposals represent the strongest dataset concepts to emerge from the competition. Each addresses a critical data bottleneck in its field, with the potential to unlock new AI capabilities for scientific discovery. All winners are receiving ongoing support from Renaissance Philanthropy to develop partnerships, refine budgets, and prepare for funding calls from DSIT and other funders.
Engineering Biology
PePcube
Lead: Rebecca Birolo
PePcube aims to generate the largest systematically measured dataset of peptide developability properties to unlock the design of therapeutic peptides with co-optimised target affinity and other properties necessary for their administration and manufacturability.
Unlocking Nature’s Regulatory Code (Darwin Tree of Life)
Lead: Mark Blaxter
The Darwin Tree of Life project will create the world's richest AI-ready genomic dataset, spanning 20,000 eukaryotic species' genomes plus rich functional datasets, to power a new generation of genomic large language models capable of predicting how genome sequence controls all of life and enabling the next wave of innovation in engineering biology.
The Binding Affinity Bottleneck in Protein Design
Lead: Jakub Lála
By measuring protein binding at unprecedented scale, we aim to build the open dataset powering the next generation of post-AlphaFold AI — moving from predicting protein structure to predicting function for drug discovery
Preparing the World’s Biological Data for the Age of AI (Covalent.Bio)
Lead: Nick Schaum
Covalent is transforming the world’s fragmented biological data into a single, continuously updated semantic layer ready for AI-driven discovery.
Medical Research
Beyond Opportunistic Datasets: Systematic Generation of Training Data for Immunogenicity Algorithms
Lead: Christopher Thorpe
Being able to predict immunogenicity is critical for the design and safety of next-generation and personalised immune therapies for autoimmunity, infection and cancer. Current approaches are limited by data quality and diversity; we propose collecting higher-quality, more diverse data through active learning prioritisation and closed-loop lab automation.
BioImageNet-UK
Lead: Janos Kriston-Vizi
BioImageNet-UK bridges the critical 'Validation Gap' in TechBio by providing a 50TB ground truth architecture to rigorously audit and de-risk AI foundation models in generalist bioimaging.
An Adversarial Benchmark for AI Drug Discovery
Lead: Antonio Ruiz-Gonzalez
Ignota Labs is using AI to map the hidden chemical boundaries where a drug's activity suddenly shifts, creating an open-source database to improve clinical translation rates of new therapies for patients who need them.
An FRO for Receptor Abundance, Specificity, and Internalisation
Lead: Louis “Bobby” Hollingsworth & Becca Carlson
In five years, the Deliverome Project will build public AI-ready datasets for directing medicines to specific tissues, cell types, and diseases—multiplying by 10× the number of validated targets for precision delivery.
Advanced Materials
Open Defects Atlas
Lead: Seán Kavanagh
The Open Defects Atlas will create the first large-scale, purpose-built dataset of atomic defects to enable AI-accelerated materials discovery and design across semiconductors, quantum technologies, nuclear fusion and beyond; positioning the UK as a global leader in AI-driven materials innovation.
Nuclear Fusion
An Open, AI-Ready Framework and Reference Dataset for Fusion Time-Series Data
Lead: Federico Felici
Enabling the fusion energy community to leverage AI and large-scale physics validation thanks to an open framework for fusion time-series data curation.
What’s Next
The 10 winning proposals are now being supported in building coalitions, refining budgets, and developing execution plans. They are also being supported to apply for funding calls from DSIT and other potential funders.
For questions about this initiative, please contact datasets@renphil.org