Comment from Martin Steinegger

Martin Steinegger is an Assistant Professor in the biology department at the Seoul National University. He is a co-author of the 2021 paper describing AphaFold2, by Jumper et al, Nature, 596(7873), pp.583-589.

He sent the following comment:

These are my thoughts on the milestones for MSAs required for the success of AlphaFold2.

The term “MSA” might be misinterpreted in this context. In my text, I’m referring to query-centered MSAs generated through homology searches, not the global alignments that we can obtain from progressive aligners like ClustalW once we have more data.

1965: Margaret Oakley Dayhoff's Atlas of Protein Sequence and Structure

Margaret O. Dayhoff pioneered the systematic collection and analysis of protein sequences with the publication of the Atlas of Protein Sequence and Structure. This work compiled all ~70 known protein sequences at the time. Dayhoff's efforts laid the groundwork for bioinformatics and the development of substitution matrices—such as the PAM (Point Accepted Mutation) matrices—which are used for sequence alignment scoring to date.

Dayhoff, M.O. et al. (1965). Atlas of Protein Sequence and Structure, National Biomedical Research Foundation.

1970–1982: Needleman–Wunsch, Smith-Waterman to Gotoh

In 1970, Saul Needleman and Christian Wunsch introduced the Needleman–Wunsch algorithm, the first systematic method for global sequence alignment using dynamic programming. This method was followed by the Smith–Waterman algorithm in 1981, which provided a framework for local alignments to detect conserved regions. In 1982, Osamu Gotoh refined these methods by devising an elegant approach to compute affine gap penalties, thereby enabling rapid and biologically accurate sequence alignments. Together these seminal works developed the algorithm that is now executed billions of times daily to compute pairwise protein alignments.

Needleman, S.B., & Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology

Smith, T.F., & Waterman, M.S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology

Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology

1986–2002: Swiss-Prot, TrEMBL, and UniProt

In 1986, Amos Bairoch established Swiss-Prot, a database of curated proteins and later, to handle the exponential growth in protein sequence data, TrEMBL was introduced in 1996 as a complementary database containing computationally annotated entries. In 2003, Swiss-Prot and TrEMBL merged to form the Universal Protein Resource (UniProt). The extensive open source protein sequence data in UniProt was indispensable for generating diverse multiple sequence alignments, critical for training AlphaFold2.

1990: BLAST (Basic Local Alignment Search Tool)

Altschul et al. introduced BLAST, a tool that revolutionized sequence searches by enabling rapid detection of sequence similarities. This innovation dramatically improved researchers' ability to search through ever-growing protein databases through a seed-and-extend-based alignment scheme. Additionally, the introduction of the E-value—derived from the Karlin-Altschul statistical framework—provided a robust measure for assessing the likelihood of a match occurring by chance, thereby grounding sequence alignment in solid statistical principles.

Altschul, S.F. et al. (1990). Basic local alignment search tool. Journal of Molecular Biology

Karlin, S., & Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences

1997: PSI-BLAST (Position-Specific Iterated BLAST)

Building upon the BLAST framework, PSI-BLAST constructs sequence profiles from initial alignments and iteratively searches through the database. This allowed researchers to detect even more remotely homologous relationships efficiently.

Altschul, S.F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research

1998: HMMER: fast profile Hidden Markov Models to sequence alignments

Sean Eddy developed HMMER, an efficient suite of methods that applies hidden Markov models to sequence search and alignment. By incorporating probabilities for insertions and deletions into the profile scoring, HMMER significantly improved the sensitivity of sequence comparisons, establishing itself as a critical tool for the large-scale annotation of protein families and domains.

Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755–763.

Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology

2005–2012: HH-suite - fast HMM-HMM alignment

Johannes Söding and colleagues developed HHsearch and HHblits, which compare hidden Markov models against hidden Markov models (HMM–HMM). This method greatly enhances sensitivity for detecting remote homology, enabling the discovery of extremely distant relationships that might be missed by traditional approaches.

Söding, J. (2005). Protein homology detection by HMM–HMM comparison. Bioinformatics,

Remmert, M. et al. (2012) HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment. Nature Methods.

2016–2019: MMseqs2 and Linclust, clustering protein in linear time

Efforts to cluster vast amounts of sequence data led to the development of tools like MMseqs2 (Many-against-Many sequence searching) and Linclust. These methods enable ultra-fast clustering and detection of distant homologs in large-scale sequence datasets, facilitating the building of large-scale reference databases and comprehensive multiple-sequence alignments for downstream analyses. Additionally, the Uniclust resource was established to provide deeply clustered and annotated protein sequence databases based on UniProt data utilized for AlphaFold2 training.

Hauser, M., Steinegger, M., & Söding, J. (2016). MMseqs: software suite for fast and deep clustering and searching of large sequence sets. Bioinformatics

Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology

Mirdita, M. et al. (2017). Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research

2017: Metagenomic Data Integration for structure prediction

The integration of metagenomic sequencing data dramatically expanded the pool of available protein sequences by adding billions of sequences from diverse microbial communities. This expansion—driven by a global effort to sequence and deposit experiments—

has vastly improved the breadth and accuracy of multiple sequence alignments used in protein structure prediction and other analyses.

Ovchinnikov, S. et al. (2018). Protein structure determination using metagenome sequence data. Science

2021: Making AlphaFold2 accessible to all through ColabFold

ColabFold made AlphaFold2 predictions widely accessible to researchers and practitioners without access to large-scale computing infrastructure by providing high-quality, rapid and free-of-charge multiple-sequence alignment (MSA) generation through a publicly accessible MMseqs2 search server and a user-friendly Google Colab-based notebook interface.

Mirdita, M. et al. (2022). ColabFold: making protein folding accessible to all. Nature Methods