Abstract

One of the most pressing challenges in genomics is to reconstruct a long, and contiguous genome sequence from short contigs. Recent work shows that long-range link information contained in Hi-C data and Chicago date can be exploited to recover the correct orders among contigs. In particular, there tends to be higher Hi-C link or Chicago link densities between pairs of two closely located contigs. For studying the problem of ordering contigs based on long-range link information, we propose a planted Hamiltonian path model, where edges on the hidden Hamiltonian path tend to be more heavily weighted than other edges. The goal is to recover the hidden Hamiltonian path from observation of edge weights. We identify the sharp information-theoretic limits of exact recovery and show that a simple greedy algorithm achieves the exact recovery limit within a factor of two.