Abstract

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used to aid the assembly of subsequent genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and indels. The genomes of subsequent samples are reconstructed as paths through the graph using an efficient hidden Markov Model structure in which short read data is efficiently summarised through a de Bruijn graph. By applying the method to the extended HLA MHC region, combining eight assembled haplotypes, sequences of known classical HLA alleles, and 87,640 variants from the 1000 Genomes Project, we show, using SNP genotyping, short-read and long-read data, how the method improves the accuracy of individual genome assembly. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, making the case for continued development of reference-quality genome sequences.

Video Recording