Description

The estimation of the "Tree of Life" from biomolecular sequence data presents many interesting computational, mathematical, and statistical challenges. In particular, the most accurate methods have depended on having an accurate multiple sequence alignment, and the estimation of an accurate multiple sequence alignment is itself a challenging problem. SATe (Liu et al., Science 2009 and Systematic Biology 2012) provides accurate large-scale multiple sequence alignments, but not beyond (approximately) 30,000 sequences.

In this talk, I will present two new methods for ultra-large alignment. The first is UPP (Ultra-large alignment estimation using SEPP, unpublished). The basic technique is a tree-based decomposition of a single HMM (Hidden Markov Model) into a family of HMMs, to represent a multiple sequence alignment. UPP is able to estimate highly accurate alignments of up to 1,000,000 sequences very efficiently; other applications of this "HMM Families" technique are able to improve phylogenetic placement and taxonomic classification of short reads. The second method is called "PASTA" (Practical Alignments using SATe and Transitivity). PASTA produces somewhat more accurate alignments and trees than UPP but is not quite as scalable. Finally, I will also present DACTAL (Divide-And-Conquer Trees (ALmost) without alignments, RECOMB 2012 and Bioinformatics 2012), which can estimate a tree without a full multiple sequence alignment.