Open data key to cracking the protein structure prediction problem
Proteins are the building blocks for all living things, providing structure and managing processes in cells. Understanding how these molecules fold into specific 3D shapes is key to understanding their function but requires expensive equipment and lots of time, limiting the progress of research and development.
A new artificial intelligence programme called AlphaFold has been shown to accurately predict protein structure in minutes, solving a decades old challenge. Its success is built on the availability of thousands of experimentally determined protein structures, a result of long-term research funding, infrastructure investment and data-sharing policies.
DeepMind, the developers of AlphaFold, have made the AlphaFold code and protein structure predictions openly available to the global scientific community. This could mark a major change for both fundamental research and a range of applications – developing new drugs, designing crops resistant to climate change or advancing bio-based technologies.
- A 50-year-old grand scientific challenge has been solved, harnessing open data and long-term investment in digital research infrastructure.
- Thousands of new predicted structures are now openly available to the research and development community through a collaboration between EMBL-EBI and DeepMind, unlocking new research opportunities.
- The software has already been used in the search for enzymes that could recycle single-use plastics. Future applications could include understanding disease, designing climate change resilient crops, accelerating drug development, and tackling antimicrobial resistance.
Understanding the variety of life
Proteins are the key molecular building blocks of life. They fold into complex 3D shapes that determine their function, so knowing the precise 3D structure of a protein is important for many applications in biotechnology and the life sciences, such as the development of new pharmaceuticals.
Proteins are formed from smaller components called amino acids that link together in chains, like beads on a string. There are 20 different genetically encoded amino acids that can combine in different ways to form hundreds of millions of unique proteins in nature, each of which adopts a unique 3D structure.
Protein data for all
Determining a single 3D protein structure experimentally is difficult and time consuming, often requiring months or years of research effort in the laboratory. Because solving just one protein structure is difficult, there is great benefit in sharing experimentally determined structures, rather than researchers and innovators having to replicate work independently.
These benefits of an open and efficient approach have been realised, initially through journals requiring structural co-ordinates to be available for assessment by any structural biologist, and more recently via the BBSRC data-sharing policy (ref 2). Facilitating these mandates, UK Research and Innovation (UKRI) through BBSRC and the Medical Research Council (MRC), along with the Wellcome Trust have supported world-leading infrastructures for biological information. Resources developed include:
- the European Molecular Biology Laboratory-European Bioinformatics Institute's (EMBL-EBI) Protein Data Bank in Europe (PDBe) (ref 3), a repository for experimentally determined protein structures
- UCL’s CATH (ref 4), which categorises those protein structures, using common folding sections called domains, to identify evolutionary relationships.
There are currently more than 180,000 experimentally determined protein structures that are available worldwide through PDB (ref 5). These still represent only a small fraction of the proteins discovered through genome sequencing to date (ref 6). Only 35% of human proteins map to a PDB entry and in many cases, the structure covers just a fragment of the sequence (ref 7). However, these structures have proven key to enabling the new breakthrough from DeepMind’s deep learning approach.
From the lab to digital
Can we use computers to predict the 3D structure of a protein from its amino acid sequence? This ‘protein folding problem’ is a 50-year-old grand scientific challenge; the problem lies in the astronomical number of ways a protein could potentially fold. In 1994, Professor John Moult and Professor Krzysztof Fidelis established a global biennial community experiment to test the latest approaches to tackle this intractable problem - the Critical Assessment of protein Structure Prediction (CASP) (ref 8).
In December 2020, DeepMind’s AlphaFold 2 was recognised as the best performing method in the 14th round of the CASP experiment. For the first time, a prediction method provided an accuracy that was consistent with experimentally determined 3D protein structures. The breakthrough was worldwide news and featured by outlets including the BBC, the Times (ref 9), The Guardian (ref 10), the Daily Mail (ref 11), USA Today (ref 12) and Scientific American (ref 13).
“Our goal at DeepMind has always been to build AI and then use it as a tool to help accelerate the pace of scientific discovery itself, thereby advancing our understanding of the world around us. We used AlphaFold to generate the most complete and accurate picture of the human proteome. We believe this represents the most significant contribution AI has made to advancing scientific knowledge to date, and is a great illustration of the sorts of benefits AI can bring to society.” Demis Hassabis, PhD, Founder and CEO, DeepMind.
Accelerating future R&D
DeepMind are making the many millions of AlphaFold structure predictions and the code freely available to the global research and innovation community, launching together with EMBL-EBI, the AlphaFold Protein Structure DataBase (AlphaFold DB). John Jumper, PhD, AlphaFold Lead, DeepMind, says “As the database expands, models will be available for almost every catalogued protein. AlphaFold DB is likely to transform how we approach bioinformatics, the large-scale study of DNA and proteins, as it will enable us to study the proteins of all known organisms with near-atomic precision. We are optimistic that the promise and machine learning advances of AlphaFold will spur the development of an exciting new phase of protein research, where deep learning tools enable quantitative understanding of biology hand-in-hand with experimental methods.”
|182,000||Number of structures in the PDB (as of 15/09/21)|
|365,000||Number of predicted structures in AlphaFold DB (as of 15/09/21)|
|10300||Estimated possible conformations for a typical protein (ref 14)|
|50||Number of years to solve the protein-folding problem|
|92.4||Winning CASP14 score (median across all targets)|
The expansion in coverage - over 100 million structures - will pose further challenges for archiving predicted and experimental data, but will have huge implications for the field of structural biology and biology in general. Not just a massive saving in time, but also a change in the way research can be conducted; allowing scientists to design and conduct experiments with greater confidence.
“AlphaFold models can be used to help determine structures through experimental methods. Having a sufficiently accurate initial prediction of the structure will allow researchers to revisit and solve old X-ray datasets and cryo-EM maps for which model building wasn’t previously possible. This is a great example of how computational methods are complementary to experimental approaches.”
Kathryn Tunyasuvunakool, PhD, Science Engineer, DeepMind
Tackling key challenges
Dr Vaishali Waman, CATH researcher, UCL, is currently funded on an UKRI BBSRC Covid project to examine the impacts of SARS-CoV-2 variants of concern. On her screen is the SARS-CoV-2 Spike protein in complex with human ACE2. Vaishali has been examining whether mutations in the virus, particularly variants of concern, are likely to be affecting infectivity.
Proteins are involved in every biological process and their 3D shape determines what their role is, or in the case of missing or misshapen proteins, what the disease outcome will be. Their role in health is hard to overstate; for example, our metabolism is regulated by a protein, insulin, while cancer is strongly linked to malfunctioning proteins that cause tumour growth.
“CATH has been used to analyse the effects of mutations in patients with lung cancer. The AlphaFold data will allow us to treble the numbers of protein structures for these human proteins and obtain a much more complete and accurate picture of the functional impacts caused by the mutations, which could be influencing disease progression and drug resistance.”
Professor Christine Orengo, Professor of Bioinformatics UCL
Some proteins function as enzymes, molecules that speed up processes by acting as a catalyst. Probably best known for their role in biological washing detergents, researchers are also exploring how we can harness some of these molecules to tackle plastic waste.
“AlphaFold provides us with an exciting new library of templates to engineer faster, more stable and cheaper enzymes for plastic recycling. This doesn’t replace the experimental techniques, in fact given the huge number of new targets that we can now investigate, we are going to need to employ experimental facilities such as the Diamond Light Source even more. In fact, I believe the next biggest scientific discoveries will emerge from the synergistic combination of experimental and AI technologies. Their open access model is now going to open this technology to laboratories around the world, accelerating collaborative efforts to develop enzyme-based solutions for recycling and upcycling plastic bottles and polyester textiles.”
Professor John McGeehan, Centre for Enzyme Innovation, University of Portsmouth
“This will be one of the most important datasets since the mapping of the Human Genome. Making AlphaFold predictions accessible to the international scientific community opens up so many new research avenues, from neglected diseases to new enzymes for biotechnology and everything in between. This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world.”
Ewan Birney, EMBL-EBI Director
Header image: Structure of a human protein (TRPC5) used to transport calcium across membranes in the nervous system in complex with an inhibitor (Pico145) (ref 1)
- Human TRPC5 structures reveal interaction of a xanthine-based TRPC1/4/5 inhibitor with a conserved lipid binding site | Communications Biology (nature.com) http://doi.org/doi:10.1038/s42003-020-01437-8
- BBSRC (2007) “BBSRC Data Sharing Policy: version 1.22 (March 2017 update)” URL: BBSRC data sharing policy
- PDBe home < EMBL-EBI
- CATH: Protein Structure Classification Database at UCL (cathdb.info)
- Great expectations – the potential impacts of AlphaFold DB | EMBL
- The UniProt Consortium, UniProt: the universal protein knowledgebase in 2021, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D480–D489, https://doi.org/10.1093/nar/gkaa1100
- Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature (2021). https://doi.org/10.1038/s41586-021-03828-1
- A large‐scale experiment to assess protein structure prediction methods - Moult - 1995 - Proteins: Structure, Function, and Bioinformatics - Wiley Online Library
- Artificial intelligence can predict the structure of almost every protein made by body, study finds | Daily Mail Online
- Levinthal's Paradox (archive.org)