top of page

From PDB to AlphaFold

The following timeline is a draft of a collective history and reflection about the datasets (especially the PDB) and scientific advances that have given origin to AlphaFold.

Contributions are welcome, for example detailed accounts and verifications from the individuals involved, interviews with them, identifications of relevant documents and reflections about the future AI in science that are inspired by this history.

 

1958-1960 The first protein structures, myoglobin and hemoglobin, were determined by John Kendrew and Max Perutz at Cambridge, UK; both were solved using X-ray crystallography.

 

1962 John Kendrew and Max Perutz received the Nobel Prize in Chemistry for their discoveries.

 

1969 Cyrus Levinthal described the paradox of protein folding: the folding process must be guided by specific interactions and not by a random search through all possible conformations, which would take an immensely long time.

 

1970-1971 As described by Helen M. Berman (Berman, 2008):

"The establishment of the Protein Data Bank (PDB) began in the 1970's as a grassroots effort. A group of (then) young crystallographers, including Edgar Meyer, Gerson Cohen and myself, began discussing the idea of establishing a central repository for coordinate data at an American Crystallographic Association (ACA) meeting in Ottawa, Canada, in 1970. Those conversations were continued with a larger group at the ACA meeting in Columbia, South Carolina, USA, in 1971. At that meeting, a petition was written, and a proposal was submitted to the United States National Committee for Crystallography (USNCCr)."

In 1970 Meyer wrote to Helen Berman that he "initially thought about approaching the International Union of Crystallography (IUCr), but became discouraged when told he would run into the opposition of “certain blocking groups.” (Strasser, 2019)

 

1971 June. The meeting at Cold Spring Harbor.

As written by (Strasser, 2019):

"Any solution to this problem would require a broad international consensus. Fortunately, a unique opportunity soon arose to discuss the data bank project with the international crystallographic community. In June 1971, the Cold Spring Harbor Symposium on Quantitative Biology was devoted to the “Structure and Function of Proteins at the Three-Dimensional Level.” Organized by James Watson, the list of attendees of this select meeting read like a “who’s who” in protein crystallography, including (future) Nobel Prize winners Dorothy Crowfoot Hodgkin, Max Perutz, Aaron Klug, and William N. Lipscomb. Although the meeting was by invitation only, a few scientists who were too junior to be on the list decided to participate anyway and “kind of crashed the meeting.” Helen Berman and three friends, self-described “hippies” who valued communitarian ideals, drove from Philadelphia to Long Island to attend the meeting and present the idea for a crystallographic data bank." [figure 4.4 in the Strasser book shows Sung-Hou Kim, Helen M. Berman, Joel L. Sussman, and Nadrian C. Seeman, in front of MIT, on their way (uninvited and mostly unregistered) to the Cold Spring Harbor Symposium on Quantitative Biology]

From Helen M. Berman (Berman, 2008):

"The discussions within the meeting room, on the lawn, and on the beach were exciting and intense. In an informal meeting convened by Max Perutz, protein crystallographers discussed how best to collect and distribute data.

During the CSH meeting, [Walter] Hamilton was approached with the idea that had been discussed within the ACA community – a public data bank of protein structures. At an ad hoc meeting of protein crystallographers attending the Symposium, it was proposed that there should be a repository with identical files in the United Kingdom and in the USA. Hamilton volunteered to set up the American data bank at Brookhaven.

When Max Perutz returned to England, he discussed this proposal with Olga Kennard, who was the founder of the Cambridge Crystallographic Data Centre (CCDC) and had wide experience in assembling and archiving crystallographic data. Walter Hamilton wrote to her with an offer of collaboration and proposed to meet and discuss some of the details of coordinating the activities. He visited England that summer and, by October 1971, the establishment of the Protein Data Bank archive, jointly operated by the CCDC and BNL, was announced in Nature New Biology"(1971, Nature New Biology)

At the time Walter Hamilton was Deputy Chairman of the Chemistry Department at Brookhaven National Laboratory. He was also a former President of the American Crystallographic Association.

 

1971-1976 The PDB was established at Brookhaven National Laboratory under the leadership of Walter Hamilton. Originally contained 7 structures and initially it grew slowly. By 1976 a total of 13 structures were contained in the database.

 

1972 Christian Anfinsen received the Nobel Prize in Chemistry for his work showing that all the information for a 3D protein structure is contained in the sequence of amino acids.

 

1982 The sequence databases at GenBank in the US (Jordan, 1982) and at EMBL in Europe (Hamm, 1986) were opened to the public (Strasser, 2008; Strasser, 2019). The databases at GenBank, EMBL and after 1986 at DDBJ in Japan are mirror organizations with the same content.

 

1989 An article by Marcia Barinaga in Science about "The Missing Crystallography Data" provides a very informative snapshot of the ongoing discussions (Barinaga, 1989). The article mentions a letter initiated by Frederic Richards (Yale) in 1987 and co-signed by 173 colleagues, encouraging the sharing of protein structure data. Among the leading petition signatories in addition to Richards were Jane and David Richardson. A second letter by Richard Dickerson (UCLA) in 1989 made the point again and presented data showing that less than half of published DNA structures disclosed the coordinates. Dickerson stated that "we are on our way to developing a miniature scandal." There was a difference of opinion among Editors of the major scientific journals, including Science and Nature, regarding the necessity of sharing all the details of the structures at the time of publication. Industry opinions also differed. The deputy director of NIGMS, Marvin Cassman, encouraged public deposition of data and hoped that the scientific community would come to an agreement about this.

The International Union of Crystallography published guidelines recommending deposition of data, but as a compromise among different viewpoints the release of the coordinates could be delayed for up to 1 year after publication, and the release of the x-ray data could be delayed for up to 4 years.

An editorial by John Maddox, the Editor of Nature, was published in September 1989, soon after the Barinaga article, and defended the policy of not requesting a database deposition as a condition of publication of structural biology and DNA sequencing papers (Maddox, 1989). In November 1989 letters to Nature by scientists condemning this policy followed. Richard J. Roberts (at CSHL, later awarded the 1993 Nobel Prize in Physiology or Medicine) stated that he was "appalled by the comments of John Maddox" (Roberts, 1989). Thomas Koetzle, at the time Director of PDB, more diplomatically encouraged Nature to "reconsider its policy of not requiring deposition of data in the appropriate databases" (Koetzle, 1989).

In 1989 Renato Dulbecco (Nobel Medicine 1975) published remarks about the world of science moving away from open communication and sharing, which he confirmed in his interview shown on this site, as did many other sources. What was eventually achieved by the PDB is even more remarkable because it worked against this broader trend.

 

1994 CASP (Critical Assessment of protein Structure Prediction) was co-founded by John Moult and Krzysztof Fidelis, as a blind and independent test of software for the prediction of protein structure from sequence.

Rosetta (Rohl, 2004), contributed by the lab of David Baker, was one of the most successful methods in the initial phase.

The results improved until 2002, but after that date they were essentially flat. The next major improvements were in 2018 and 2020 with Alphafold and Alphafold2.

 

​

​​

​

​

​

​

​

​

​

​

 

 

Figure 1: The scores of the winner of the CASP competition, that took place every 2 years, starting in 1994. Note the long period of stasis. The last two time points are AlphaFold in 2018 and AlphaFold2 in 2020. GDT (Global Distance Test) is the main metric used to evaluate predictions submitted to CASP.  The figure is from of a DeepMind video about the making of AlphaFold (minute 7:27).

​

1998 Nature, Science and PNAS adopt policies requesting the immediate release of high-resolution structural coordinate data upon publication.

Nature stated that "It is clear that there is a significant majority opinion in the community against permitting a one-year hold. Accordingly, Nature, simultaneously with Science, is changing its policy. Any paper containing new structural data received on or after 1 October 1998 will not be accepted without an accession number from the Brookhaven Protein Data Bank (PDB) accompanied by an assurance that unrestricted (“layer-1”) release will occur at or before the time of publication."

 

1999 The Research Collaboratory of Structural Bioinformatics (RCSB) became the new manager of the PDB (Berman, 2000). The three member institutions of the RSCB were: Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology.

The new director was Helen Berman of Rutgers University. The San Diego Supercomputer Center site at UC San Diego was led by Philip Bourne.

 

2000 Larry Page, who co-founded Google in 1998, predicts in an interview  (minute 3:59) the importance of AI for providing answers to search inquiries and for the future Google.

 

2003 The worldwide PDB is announced (Berman, 2003) as a collaboration of three organizations: The RCSB, the Macromolecular Structure Database at the European Bioinformatics Institute (EBI) and the Protein Data Bank Japan (PDBj) at the Institute for Protein Research in Osaka University. The goal is "maintaining a single archive of macromolecular structural data that is freely and publicly available to the global community". 

 

2008 The first significant use of GPUs (graphics processing unit) in machine learning applications. An account was presented by Rajat Raina and Andrew Ng at a NIPS workshop. GPUs were initially developed for digital image processing and used by the videogame industry but were later found to be able to considerably speed up calculations needed in AI applications.

 

2009  Initial publication of ImageNet, a very large and systematic dataset of labelled images built by a group coordinated by Fei-Fei Li and designed to support AI vision research.

According to Fei-Fei Li “One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research. People really recognize the importance the dataset is front and center in the research as much as algorithms.”

 

2010 14-15 August. Demis Hassabis presented at a conference in San Francisco, the Singularity Summit, a series of yearly conferences about artificial intelligence, initially supported by Peter Thiel. The singularity is the moment when artificial intelligence becomes more capable than human intelligence. The title of Hassabis talk was “A Systems Neuroscience Approach to Building AGI” and the first slide already showed the DeepMind logo. He suggested that machine learning and knowledge of neuroscience could be combined to design artificial general intelligence. Several slides showed the work of Tomaso Poggio, a leading AI scientist. Hassabis had been a visiting scientist in the lab of Poggio at MIT.

Shane Legg also presented and spoke about “Universal measures of intelligence

In November DeepMind was officially founded by Hassabis, Shane Legg and Mustafa Suleyman.

 

2010 to early 2011 Funding of DeepMind by Venture Capital groups, led by Peter Thiel and his Founders Fund. Tomaso Poggio was also a minor investor (DeepMind 2011 Annual return ). According to a 2024 interview with Shane Legg (minute 7:06) in 2010 people thought that AI was a failed area, and nobody wanted to fund it. Especially in the case of DeepMind, because they did not just propose to do some machine learning, they wanted to build artificial general intelligence. Peter Thiel funded them because he is a well-known contrarian investor. He obtained opinions from other people who probably told him that investing in this company was a bad idea. One of the first breakthroughs was an algorithm that could play many different Atari videogames. This was the first general algorithm.

 

2012 AlexNet, a convolutional neural network (CNN), designed by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton from the University of Toronto, won the ImageNet Large Scale Visual Recognition Challenge. It was the first model based on neural networks to win the competition, achieving a large improvement compared to previous methods. The three authors were hired by Google in 2013.

It has been widely commented that the ImageNet 2012 competition triggered the big explosion of interest in AI.

 

2014 January. Google bought DeepMind for around $600m but DeepMind remained as a separate entity. DeepMind obtained access to a large computational infrastructure and capital for expanding and acquiring top talent for their team.

In 2024 DeepMind merged with Google Brain and Hassabis was put in charge of the entire Google AI effort, possibly as a response to the success of ChatGPT.

 

2014 The total number of structures deposited in PDB surpasses 100,000.

 

2015 Emails from this period between Elon Musk, Sam Altman and others were recently published as part of a court case.

They show how a concern about Google and DeepMind dominating AI was one of the motivations for starting OpenAI, the developer of ChatGPT. Musk and Altman wrote that "OpenAI is a non-profit artificial intelligence research company with the goal of advancing digital intelligence in the way that is most likely to benefit humanity as a whole, unencumbered by an obligation to generate financial returns." An email by Altman stated "OpenAI's mission is to ensure that artificial general intelligence (AGI) - by which we mean highly autonomous systems that outperform humans at most economically valuable creative work - benefits all of humanity."

We now know that for-profit motives have later become more central at OpenAI, and a reflection about the appropriate governance structure for non-profit AI might be needed.

 

2016 The AlphaGo match was another proof of principle for DeepMind. After the AlphaGo match Hassabis remembered playing Foldit, a game designed by David Baker and others to allow the general public to participate in protein folding efforts, and other discussions about this problem going back to his college days. Foldit showed the potential of crowdsourcing in science. (Cooper, 2010)

DeepMind started a serious effort on the protein folding problem.

 

2017 Publication of the "Attention Is All You Need" paper about transformers by a group from Google (Vaswani, 2017).

 

2017 Recently launched validation efforts at PDB are described (Gore, 2017). PDB produces a validation report that is required for review by an increasing number of journals. The report provides metrics to evaluate the quality of the experimental data, the structural model, and the fit between them.

 

2018 AlphaFold from DeepMind wins CASP13 (Senior, 2020).

 

2020 AlphaFold2 wins CASP14 and it is considered by many to have essentially solved the protein folding problem (Jumper, 2021). The authors stated that "This bioinformatics approach has benefited greatly from the steady growth of experimental protein structures deposited in the Protein Data Bank (PDB), the explosion of genomic sequencing and the rapid development of deep learning techniques to interpret these correlations." (Jumper, 2021)

Various parts of the model used copies of PDB obtained at different times from 2019 to 2020.  In 2019 PDB contained 158,794 structures and in 2020 contained 172,779 structures.

2,423,213,294 protein sequences were obtained from UniProt and other open resources and were used for multiple sequence alignments providing evolutionary information. The majority of the proteomes in UniProt are based on the translation of genome sequences from GenBank and the other mirror sites in Europe and Japan. (UniProt Consortium, 2021)

Alphafold2 is based on a modified transformer architecture. It uses comparative evolutionary information in the Evoformer, and then passes information to another transformer called the structure module. The information cycles between the two modules. The DeepMind team working on AlphaFold was led by John Jumper and supervised by Demis Hassabis.

 

2021 RoseTTAFold (Baek, 2021), developed by a team led by David Baker, incorporated ideas from AlphaFold2 and achieved accuracies approaching it.

 

2022 DeepMind releases structure predictions for 218 million proteins, nearly all known proteins.

 

2024 The lab of David Baker releases RoseTTAFold All-Atom, which predicts 3D structures of assemblies of proteins and other small molecules. (Krishna, 2024)

OpenFold (Ahdritz, 2024), an open-source implementation of AlphaFold2 including the code and data required to train new models, is produced by a large academic collaboration and yields insights into its learning mechanisms and capacity for generalization.

DeepMind releases AlphaFold3, which adds a diffusion-based method to predict binding structures and interactions of proteins with other molecules. (Abramson, 2024)

 

2023 - 2024  AlphaMissense is another AI tool developed by DeepMind. It was published in Science in September 2023 and predicts the pathogenicity of all possible human single amino acid substitutions (Cheng, 2023). All the components of the AlphaFold and AlphaFold2 AI models were shared openly, but in the case of AlphaMissense the trained weights, a set of parameters essential for running the model, were not shared. When AlphaFold 3 was published in Nature in May 2024 the code was not provided (Abramson, 2024). A server was offered for non-commercial use, but the number and types of queries allowed was limited.

A petition signed by more than one thousand scientists expressed disappointment with the lack of disclosure of the AlpaFold3 code at the time publication in Nature. Not even reviewers were given access to the code during review.

Six months after publication the code of AlphaFold3 was released for academic use.

Opinions about transparency and AI from several journal editors (FEBS Journal, JBC and Science) are also shared on this site.

 

2024 Oct. Nobel in Chemistry for David Baker, Demis Hassabis and John Jumper announced. Part of the Nobel Prize in Physics was awarded to Geoffrey Hinton.

 

2024 Nov 18.  AI for Science Forum, co-hosted by Google DeepMind and the Royal Society.

In one of the sessions, Janet Thornton (minute 19:25), Director Emeritus, European Molecular Biology Laboratory - European Bioinformatics Institute, which was closely involved in the PDB, said that it took 20 years for every scientist to come around to the idea of sharing the data. A key step was a statement from the International Union of Crystallographers saying that unless people deposited their data, they would not be able to publish in various journals. Many in the community were already on board, with some outstanding exceptions. Some of the most famous scientists did not initially share. A change of culture was needed. In academic research the data are obtained with public funds, so the case for sharing is even stronger.

Siddhartha Mukherjee (Columbia University) (minute 18:28) pointed out that patients might freely share their data to benefit the public good but might be less inclined to agree to do it for the benefit of a company. The same might be said of scientists.

Anna Greka (Core Institute Member, Broad Institute of MIT and Harvard) (minute 32:17) suggested that a dataset that could play the same role as the PDB for future AI models of the cell could be obtained by systematically perturbing human cells. It would need to be a well-controlled and clean dataset including single cell data, transcriptomics and imaging measurements.

 

Most Interviews on this site, especially the recent ones, also mention datasets that could support AI models, as the PDB did. For example, Aviv Regev and Sarah Teichmann described the Human Cell Atlas, a consortium that aims to create a comprehensive reference map of all human cells; Gene Yeo mentioned the potential importance of collecting data about all possible RNA modifications;  Jack Gilbert described the complexities and opportunities arising from microbiome data; Gary Siuzdak and Bruno Conti spoke about metabolomics and metabolomics databases.

In the Interviews there were also several mentions of virtual models at the cell or higher level that would need integrations of many datasets.

 

In another session of the AI for Science Forum, Paul Nurse (Director of the Francis Crick Institute and 2001 Nobel Prize in Physiology or Medicine) made several thought-provoking remarks (minutes 1:13 and 21:34). He said:

Science has increased in complexity and silos have been created, quite often self-referential silos. We must begin by seeing how we can break down those silos, how we can actually get the different parts of the scientific community talking to each other, not just collaborating, but interacting and talking one with another.

That's particularly the case with respect to artificial intelligence, because we are all being influenced by it. We must increase the permeability to it, so it doesn't become a sort of new priesthood that is somehow separate from the rest of the scientific endeavor. There are social science aspects to this, of actually getting a better working community, working across disciplines, working across different types of organizations, from universities through to industrial and commercial organizations. And that requires, not only the will, including the political will, but actually us thinking carefully about how we talk to each other, how we think about it, how people are trained in different ways in different scientific fields. We would benefit by using social scientists to help us. We need advice and help from them.

 

2024 Dec 8.  Nobel lectures by Baker, Hassabis and Jumper. All the speakers said that the PDB data had been essential for their work.

 

 

 

​

​

​

​

​

​

 

 

 

 

 

Figure 2: A list of criteria suggested by Demis Hassabis to determine if a scientific problem is suitable for an AI solution.

1- Massive combinatorial search space

2- Clear objective function (metric) to optimize against

3- Either lots of data and/or an accurate and efficient simulator

The slide was presented during his Nobel lecture.

​​

​

REFERENCES

(see also links within the text)

 

1971. Crystallography: Protein Data Bank. Nature New Biology, 233(42), pp.223-223.

 

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A.J., Bambrick, J. and Bodenstein, S.W., 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), p.493

 

Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., O’Donnell, T.J., Berenberg, D., Fisk, I., Zanichelli, N. and Zhang, B. et al, 2024. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nature Methods, pp.1-11.

 

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G.R., Wang, J., Cong, Q., Kinch, L.N., Schaeffer, R.D. and Millán, C. et al , 2021. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), pp.871-876.

 

Barinaga, M., 1989. The missing crystallography data: some disgruntled researchers are mounting a campaign to force crystallographers to make available key data when they publish the structure of complex molecules. Science, 245(4923), pp.1179-1181.

 

Berman, H.M., 2008. The protein data bank: a historical perspective. Acta Crystallographica Section A: Foundations of Crystallography, 64(1), pp.88-95.

​​

Berman, H., Henrick, K. and Nakamura, H., 2003. Announcing the worldwide protein data bank. Nature structural & molecular biology, 10(12), pp.980-980.

 

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000. The protein data bank. Nucleic acids research, 28(1), pp.235-242.

 

Campbell, P., 1998. New policy for structure data. Nature, 394(6689).

 

Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., Pritzel, A., Wong, L.H., Zielinski, M., Sargeant, T. and Schneider, R.G., 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664), p.eadg7492.

 

Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D. Popović, Z., & Foldit players, 2010. Predicting protein structures with a multiplayer online game. Nature, 466(7307), pp.756-760.

 

Gore, S., García, E.S., Hendrickx, P.M., Gutmanas, A., Westbrook, J.D., Yang, H., Feng, Z., Baskaran, K., Berrisford, J.M., Hudson, B.P., Ikegawa, Y. et al, 2017. Validation of structures in the Protein Data Bank. Structure, 25(12), pp.1916-1927.

 

Hamm, G.H. and Cameron, G.N., 1986. The EMBL data library. Nucleic acids research, 14(1), pp.5-9.

 

Jordan, E. and Carrico, C., 1982. DNA database. Science, 218(4568), pp.108-108.

 

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., et al, 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), pp.583-589.

 

Koetzle, T.F., 1989. Benefits of databases. Nature, 342(6246), pp.114-114.

 

Krishna, R., Wang, J., Ahern, W., Sturmfels, P., Venkatesh, P., Kalvet, I., Lee, G.R., Morey-Burrows, F.S., Anishchenko, I., Humphreys, I.R. and McHugh, R. et al, 2024. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science, 384(6693), p.eadl2528.

 

Maddox, J., 1989. Making good databanks better. Nature, 341(6240), pp.277-277.

 

Roberts, R.J., 1989. Benefits of databases. Nature, 342(6246), pp.114-114.

 

Rohl, C.A., Strauss, C.E., Misura, K.M. and Baker, D., 2004. Protein structure prediction using Rosetta. In Methods in enzymology (Vol. 383, pp. 66-93). Academic Press.

 

Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A.W., Bridgland, A. and Penedones, H., et al, 2020. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), pp.706-710.

 

Strasser, B.J., 2008. GenBank-Natural History in the 21st Century?. Science, 322(5901), pp.537-538.

 

Strasser, B.J., 2019. Collecting experiments: Making big data biology. University of Chicago Press.

 

UniProt Consortium, 2021. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research, 49(D1), p.D480.

 

Vaswani, A. et al, 2017. Attention is all you need. Advances in Neural Information Processing Systems.

arxiv.org/abs/1706.03762

bottom of page