Genomic sequencing assays such as ChIP-seq and DNase-seq can measure a wide variety of types of genomic activity, but the high cost of sequencing limits the number of these assays that are usually performed in a given experimental condition. I will discuss a principled method for selecting which genomics assays to perform, given a limited budget. The method relies upon optimization over submodular functions, which are discrete set functions that have properties analogous to certain continuous convex functions. I will also show how a similar submodular optimization approach can be brought to bear on the problem of selecting a representative subset of protein sequences from a large database.
I will also describe some of our work developing methods for using unsupervised machine learning to interpret large, heterogeneous collections of genomic data. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain.