Calvin Lab 116
Information Theory for High Throughput Sequencing
Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. A key computational problem is that of assembly: how to reconstruct from the many millions of short reads the underlying biological sequence of interest, be it a DNA sequence or a set of RNA transcripts? Traditionally, assembler design is viewed mainly as a software engineering project, where time and memory requirements are primary concerns while the assembly algorithms themselves are designed based on heuristic considerations with no optimality guarantee. In this talk, we outline an alternative approach to assembly design based on information theoretic principles. Starting with the question of when there is enough information in the reads to reconstruct, we design near-optimal assembly algorithms that can reconstruct with minimal amount of read information. We illustrate our approach in two settings: DNA sequencing and RNA sequencing. We report preliminary results from ShannonDNA, a DNA assembler, and ShannonRNA, a RNA assembler, and compare their performance both with the fundamental limits and with state-of-the-art software in the field.