Abstract

Petabytes of valuable sequencing data reside in public repositories, doubling in size every two years. They contain a wealth of genetic information about viruses that would help us monitor spillovers and anticipate future pandemics. We recently developed a bioinformatics cloud infrastructure, named Serratus, to perform petabase-scale sequence alignment. With it we analyzed all available RNA-seq samples (5.7 million samples, 10 petabytes) and discovered 10x more RNA viruses than previously known, including a new family of coronaviruses (Edgar et al, Nature, 2022). In this talk, I will present the computational infrastructure and some of the biological analyses.

Video Recording