Spring 2015

IT Seminar

Apr. 15, 2015 2:30 pm4:00 pm

Add to Calendar

Parent Program: 

Ilan Shomorony (UC Berkeley)


2nd floor interaction area

The Impact of Read Errors on DNA Sequence Assembly

While most current DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies promise much longer reads at the cost of high error rates. Given this technology tradeoff, it is natural to ask whether the negative impact of read errors is more than offset by the increase in read length.

While it is well known that read errors have a significant impact on the performance of existing assemblers, these observations usually pertain to specific assembly algorithms. A more fundamental question can be asked from an information-theoretic point of view: given a read length, an error rate and a coverage depth (number of reads per base), is there enough information in the read data to unambiguously reconstruct the genome? What is the fundamental tradeoff between read length and error rate?

Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the target sequence and the error rate, above which perfect assembly is guaranteed. For several real genomes, we verify that this critical read length is close to the information-theoretic lower bound for assembly from error-free reads.