Abstract

The human genome sequence contains the fundamental code that defines the identity and function of all the cell types and tissues in the human body. Genes are functional sequence units that encode for proteins. But they account for just about 2% of the 3 billion long human genome sequence. What does the rest of the genome encode? How is gene activity controlled in each cell type? Where do the gene control units lie in the genome and what is their sequence code? How do variants and mutations in the genome sequence affect cellular function and disease? Regulatory instructions for controlling gene activity are encoded in the DNA sequence of millions of cell type specific regulatory DNA elements in the form of functional sequence syntax. This regulatory code has remained largely elusive despite exciting developments in experimental techniques to profile molecular properties of regulatory DNA. To address this challenge, we have developed high performance neural networks that can learn de-novo representations of regulatory DNA sequence to map genome-wide molecular profiles of protein DNA interactions and biochemical activity at single base resolution across 1000s of cellular contexts while accounting for experimental biases. We have developed methods to interpret DNA sequences through the lens of the models and extract local and global predictive syntactic patterns revealing many causal insights into the regulatory code. Our models also serve as in-silico oracles to predict the effects of natural and disease-associated genetic variation i.e. how differences in DNA sequence across healthy and diseased individuals are likely to affect molecular mechanisms associated with common and rare diseases. Our predictive models serve as an interpretable lens for genomic discovery.

Video Recording