Abstract
A possible direction of development and application of the MAFFT multiple sequence alignment (MSA) program will be discussed. An application we are considering is the prediction of protein-protein interactions. This area has been studied well and recently AlphaFold and AlphaFold-Multimer achieved highly accurate predictions using MSAs as features. However, their well-known weak point is in prediction of antigen-antibody interactions. Our hypothesis is that one of the reasons for the difficulty is that the sequences in an MSA of antibodies have evolutionary relationships greatly different from those of other proteins. Antibodies share a common ancestry, and typically a common framework region. However, they exhibit high diversity in the complementarity-determining regions (CDRs) through recombination and hypermutations in individual cells in an individual organism. Thus, if the sequences are collected by a standard database search, an MSA of antibodies is a mixture of a variety of homologous sequences that bind different binding sites (epitopes) of various antigens. Such an MSA can be noisy for predicting the interaction between an antibody and an antigen. A solution can be to use the InterClone database, being developed in our lab. InterClone clusters antibodies by their CDR sequences and can be used to prepare a set of antibody sequences that are more likely to share an antigen and epitope than sequences gathered by a standard sequence similarity search. We are also planning to experimentally determine the positions of epitopes in several protein antigens to obtain an informative set of antibody-antigen pairs and to expand the training data for AlphaFold or other structure prediction methods.