Enlarge /. Proteins quickly form complex structures that have proven difficult to predict.
Today DeepMind announced that it appears to have solved one of biology's preeminent problems: How the amino acid chain in a protein folds into a three-dimensional shape that enables its complex functions. It is a computational challenge that has defied the efforts of many very intelligent biologists for decades, despite the fact that supercomputer-level hardware was used for these calculations. Instead, DeepMind trained its system on 128 specialized processors for a few weeks. Potential structures are returned within a few days.
The system's limitations are not yet clear – DeepMind is currently planning a peer-reviewed paper and has only provided a blog post and a few press releases. But the system is far more powerful than anything else, having more than doubled the performance of the best system in just four years. While not useful in all circumstances, the advancement likely means that the structure of many proteins can now only be predicted from the DNA sequence of the gene that encodes them, which would mean a major change in biology.
Between the folds
To make proteins, our cells (and those of every other organism) chemically link amino acids into a chain. This works because each amino acid shares a backbone that can be chemically linked to form a polymer. But each of the 20 amino acids used by life has a specific set of atoms attached to that backbone. These can be charged or neutral, acidic or basic, etc., and these properties determine how each amino acid interacts with its neighbors and the environment.
The interactions of these amino acids determine the three-dimensional structure that the chain adopts after its production. Hydrophobic amino acids get inside the structure to avoid the watery environment. Positive and negatively charged amino acids attract each other. Hydrogen bonds promote the formation of regular spirals or parallel layers. Together, these drive an otherwise disordered chain to fold into an orderly structure. And this ordered structure, in turn, defines the behavior of the protein and enables it to act as a catalyst, to bind to DNA or to drive the contraction of the muscles.
Determining the order of the amino acids in a protein's chain is relatively straightforward. They are defined by the order of the DNA bases within the gene that encode the protein. And since we're dramatically good at sequencing entire genomes, we now have an overabundance of gene sequences and thus a huge excess of protein sequences. For many of them, however, we have no idea what the folded protein looks like, making it difficult to determine how it works.
Since the backbone of a protein is very flexible, almost any two amino acids in a protein can potentially interact with each other. Figuring out which actually interact in the folded protein and how this minimizes the free energy of the final configuration becomes an insoluble computational challenge once the number of amino acids gets too large. In essence, if an amino acid could occupy potential coordinates in 3D space, figuring out where to place it becomes really difficult.
Despite the difficulties, some progress has been made, including through distributed computing and gamification of folding. However, an ongoing semi-annual event called the Critical Assessment of Protein Structure Prediction (CASP) has made fairly erratic progress throughout its existence. And in the absence of a successful algorithm, humans are left with the tedious task of purifying the protein and then using X-ray diffraction or cryo-electron microscopy to figure out the structure of the purified form. These efforts can often take years.
DeepMind enters the fight
DeepMind is an AI company that was acquired by Google in 2014. Since then, it has made a number of splashes and developed systems that have successfully defeated people at Go, Chess, and even Starcraft. In some of its notable successes, the system was simply trained by drawing the rules of a game before it was released to play for itself.
While the system is incredibly powerful, it is not clear that it would work for protein folding. For one thing, there is no obvious external standard for "gain" – if you get a structure with very low free energy, it does not guarantee that there will be something lower. There aren't a lot of rules either. Yes, amino acids with opposite charges decrease free energy when placed next to each other. However, this won't happen if dozens of hydrogen bonds and hydrophobic amino acids stick out in water.
How do you adapt an AI to these conditions? For their new algorithm, called AlphaFold, the DeepMind team treated the protein as a spatial network graph, with each amino acid serving as a node and the connections between them being mediated by their proximity in the folded protein. The AI itself is then trained in the task of figuring out the configuration and strength of these connections by feeding the previously determined structures of over 170,000 proteins from a public database.
When AlphaFold receives a new protein, it looks for proteins with a related sequence and aligns the related parts of the sequences. It also searches for proteins with known structures that also have similar regions. Typically, these approaches are great at optimizing local features of the structure, but not so good at predicting the overall protein structure – mixing a number of highly optimized parts together does not necessarily make an optimal whole. And here an attention-based part of the algorithm was used to make sure the overall structure was coherent.
A clear success, but with limits
For this year's CASP, AlphaFold and other participants' algorithms were unleashed on a number of proteins that have either not yet been solved (and solved as the challenge progresses) or solved but not yet published. Therefore, there was no way for the algorithm developers to prep the systems with real-world information, and the output of the algorithms could be compared to the best real-world data as part of the challenge.
AlphaFold did pretty well – far better than any other entry. For about two-thirds of the proteins that were predicted to have a structure, it was within the experimental error you would get if you tried to replicate the structural studies in a laboratory. Overall, an accuracy rating of zero to 100 gave an average score of 92 – again the kind of area you would see if you tried to get the structure twice under two different conditions.
With every reasonable standard the computational challenge of figuring out the structure of a protein has been solved.
Unfortunately, there are a lot of unreasonable proteins out there. Some get stuck in the membrane immediately; others quickly take up chemical modifications. Still others require extensive interactions with specialized enzymes that burn energy to force other proteins to refold. AlphaFold is unlikely to handle all of these edge cases, and without some science describing the system, it will take a while – and some real world applications – to figure out the limitations of the system. This is not intended to dissuade you from an unbelievable performance, but only to warn against unreasonable expectations.
The crucial question now is how quickly the system will be made available to the biological research community so that its limits can be defined, and we can start using it in cases where it is likely to work well and has significant value, such as: Structure of proteins from pathogens or the mutated forms in cancer cells.