Optimality Results on Compressed Representations for High Throughput Sequence Data

Abstract

Recent advances in high throughput sequencing (HTS) data compression have made it possible to reduce the size of both raw and mapped sequencing data by an order of magnitude.
Many HTS compression methods achieve this reduction in data redundancy either by assembling the reads into long contigs, or by aligning the reads to a reference genome; the reads are then encoded as simple pointers. These methods typically achieve good compression but have unreasonably high running times. The best compromise between the compression rate and running time is typically achieved by compressors that perform read mapping or assembly only implicitly; these methods first cluster reads that share long substrings, and independently compres reads in each cluster after some implicit assembly.

In this talk we will describe a general reference free compressed representation for HTS data based on complete or partial de novo assembly of reads into a forest of compact, bottom-up tries. We will also introduce a greedy method to quickly build a forest and show that it achieves the shortest possible encoding for any input data among all algorithms using this general
representation - including all known de novo assembly based compression methods. Interestingly, this encoding also approaches information theoretic optimality as the the length of the reference genome (under a number of constraints) increases. On a recently developed benchmark suite for genome sequence compression, our method improves on the compression rates achieved by the best available tools by 10% - 50%, without adding a significant burden on the running time.

Joint work with Tony Ginart, Joseph Hui, Kaiyuan Zhu, Ibrahim Numanagic, Tom Courtade and David Tse