Abstract

Genome sequencing and assembly have exploded since 2015. Today, many linages contain closely related species, as well as species with multiple diverse genome sequences. Having more genomes seems like a good thing for studying ecology and evolution across the tree of life. However, the workhorse algorithm for genomic studies, sequence alignment, is breaking down in terms of both computational efficiency and accuracy. We explore these issues using metagenomic applications in which microbial communities are sequenced as a pool and alignment is used to map reads to the correct species and genomic site before downstream bioinformatics applications such as abundance estimation and genotyping. We quantify alignment errors and computational barriers across a broad range of scenarios, including lineages in which a commonly used, operational definition of the species boundary (>95% average nucleotide identity) is blurred. Then, we propose several actionable and aspirational solutions to problems such as genome redundancy, reference bias, and cross-mapping. This work demonstrates that efficient algorithms and data structures are essential to maintain access to genomic and metagenomic data science for researchers without massive high-performance computing resources and to ensure read mapping is accurate on a densely sequenced tree of life.

Attachment

Video Recording