Compared to transcriptional regulation, post-transcriptional regulation by RNA-binding proteins (RBPs) still receives less attention. Yet, experimental approaches to read out transcript activity at distinct stages of expressions have drastically improved, as have approaches to map protein-RNA interactions that mediate regulation (e.g. CLIP). I will provide an overview over new approaches that we have developed in the past year, from protein-RNA interaction site calling to RBP motif finding to integration across multiple CLIP datasets. In most cases, our approaches are "early drafts" and provide interesting opportunities for principled computational solutions.
Tuesday, June 27th, 2017
We hypothesized that transcription factors (TFs) can recognize DNA shape without nucleotide sequence recognition. Motivating an independent role for shape, many TF binding sites lack a sequence-motif, DNA shape adds specificity to sequence-motifs, and different sequences can encode similar shapes. We therefore asked if binding sites of any TFs are enriched for a specific pattern of DNA shape features, such as minor groove width, propeller twist, helical twist, or roll. To discover these shape-motifs de novo, we developed ShapeMF, which performs Gibbs sampling directly on shape features rather than nucleotide sequences. Using ChIP-Seq data for 110 human ENCODE TFs, we find that most TFs have shape-motifs and strongly bind regulatory regions with shape-motifs in the absence of sequence-motifs. When shape- and sequence-recognition co-occur in a region, the two types of motifs can be overlapping, flanking, or separated by consistent spacing, with shape-motifs explaining low information content positions in and nearby sequence-motifs. Shape-motifs are prevalent in regions co-bound by multiple TFs. They also explain binding of the co-factors MYC-MAX and TBX5-NKX2-5, which cannot be accounted for with sequence-motifs. Finally, shape-motifs are strikingly different across TFs with nearly identical sequence motifs, providing an explanation for their distinct binding locations. These results establish shape-motifs as drivers of TF-DNA recognition complementary to sequence-motifs.
RNA-binding proteins play vital roles in many processes in the cell, yet little is known about their structural binding preferences. A comprehensive study measured the binding of more than 200 proteins to around 240K RNA probes each (Ray et al. 2013), but only focused on their sequence specificities since the RNA probes were designed to be unstructured. Recently, we made an algorithmic breakthrough in modeling and learning the structural preferences from these unstructured data (Orenstein, Wang and Berger 2016). As a result, a large-scale analysis of RNA-binding structural preferences became possible for the first time.
Here, we analyze the structural binding preferences of the largest compendium of RNA-binding proteins to-date. First, we show that RNA structural variability exists in the unstructured data, and that it is correlated with RNA-binding preferences. Second, we assert that overall RBPs prefer to bind unpaired regions, while many can bind both in loop or external regions and mote show preferences to loop regions than external. Third, we gauge the improvement in protein-binding prediction that is achieved by using RNA structure, both in vitro and in vivo. We find that structure preferences as measured in vitro correlate with structure preferences identified in vivo.
These results are the first to analyze RNA-structural preferences on such a large scale. We hope that our analysis will facilitate better understanding of protein-RNA binding and its rolls in gene regulation and disease.
Protein networks have been shown to be instrumental in deciphering the molecular basis of disease. In this talk I will describe on-going work to harness physical interaction networks for the identification of cancer driver genes and the eluciation of neurodegenrative disease genes and modules.
We present a novel geometric method for the problem of global alignment of protein-protein interaction (PPI) networks. A PPI network can be viewed as a node-edge graph and its alignment often needs to solve some generalized version of the subgraph isomorphism problem which is notoriously challenging and NP-hard. We propose a two-step algorithm for the global PPI network alignment which consists of an Embedding Step and a Geometric Step. Our algorithm first applies three different graph embedding techniques that preserve the topological structure of the original PPI networks and maps the problem from graph domain to geometric domain. Then we define a new objective function, called ``Protein Mover’s Distance (PMD)” to evaluate the alignment of two PPI networks. PMD is similar to the well known ``Earth Mover’s Distance”; however, we also incorporate some other information, e.g., the functional annotation of proteins. Our algorithm computes a rigid transformation for one of the embedded PPI networks so as to minimize its PMD to the other PPI network. By using the flow values of the resulting PMD as the matching scores, we are able to obtain the desired alignment. Comparing with existing methods, our algorithm globally optimizes the problem and yields an alignment with better quality.
Wednesday, June 28th, 2017
In this talk we'll start with a high-level overview of the GPU architecture and programming model, and its applicability to several core problems in bioinformatics. We will then present a novel data-parallel algorithm for de Bruijn graph construction on the GPU in the context of short read micro-assembly, which is a critical step of many variant discovery pipelines. Here we specifically focus on the popular Genome Analysis Toolkit (GATK) HaplotypeCaller implementation. Since typically the micro-assembly procedure is executed independently on many separate windows of the genome, a coarse-grained parallel implementation can be easily achieved by executing the original (highly) sequential algorithm on each window in parallel. However, this approach suffers from poor load balancing and is not very well suited for the GPU. On the other hand, our algorithm is a fine-grained parallel implementation of the de Bruijn graph construction across any given number of genome windows.
It relies on data-parallel primitives, such as (segmented) sort, scan, filter, and reduction by key to efficiently populate the graph data structure in the compressed sparse row (CSR) format.
Nucleotide variation in gene regulatory elements is a major determinant of phenotypes including morphological diversity between species, human variation and human disease. Despite continual progress in the cataloging of these elements, little is known about the code and grammatical rules that govern their function. Deciphering the code and their grammatical rules will enable high-resolution mapping of regulatory elements, accurate interpretation of nucleotide variation within them and the design of sequences that can deliver molecules for therapeutic purposes. To this end, we are using massively parallel reporter assays (MPRAs) to simultaneously test the activity of thousands of gene regulatory elements in parallel. By designing MPRAs to learn regulatory grammar or to carry out saturation mutagenesis of every possible nucleotide change in disease causing gene regulatory elements, we are increasing our understanding of the phenotypic consequences of gene regulatory mutations. Regulatory elements can also serve as therapeutic targets. To highlight this role, we used CRISPR/Cas9 activation (CRISPRa) of regulatory elements to rescue a haploinsufficient disease (having a 50% dosage reduction due to having only one functional allele) in vivo. By targeting the Sim1 promoter or its 270kb distant hypothalamic enhancer, we were able to rescue the haploinsufficient obesity phenotype in Sim1 heterozygous mice, both using a transgenic and adeno-associated virus approach. Our results highlight how regulatory elements could be used as therapeutic targets to treat numerous altered gene dosage diseases.
Recent advances in high throughput sequencing (HTS) data compression have made it possible to reduce the size of both raw and mapped sequencing data by an order of magnitude.
Many HTS compression methods achieve this reduction in data redundancy either by assembling the reads into long contigs, or by aligning the reads to a reference genome; the reads are then encoded as simple pointers. These methods typically achieve good compression but have unreasonably high running times. The best compromise between the compression rate and running time is typically achieved by compressors that perform read mapping or assembly only implicitly; these methods first cluster reads that share long substrings, and independently compres reads in each cluster after some implicit assembly.
In this talk we will describe a general reference free compressed representation for HTS data based on complete or partial de novo assembly of reads into a forest of compact, bottom-up tries. We will also introduce a greedy method to quickly build a forest and show that it achieves the shortest possible encoding for any input data among all algorithms using this general
representation - including all known de novo assembly based compression methods. Interestingly, this encoding also approaches information theoretic optimality as the the length of the reference genome (under a number of constraints) increases. On a recently developed benchmark suite for genome sequence compression, our method improves on the compression rates achieved by the best available tools by 10% - 50%, without adding a significant burden on the running time.
Joint work with Tony Ginart, Joseph Hui, Kaiyuan Zhu, Ibrahim Numanagic, Tom Courtade and David Tse
Most methods for construction of gene regulatory networks from expression data rely on the precision matrix, the inverse of the covariance matrix, as an indicator of which variables are directly associated. The precision matrix assumes Gaussian linear data and its entries are zero for pairs of variables that are conditionally independent given all other variables. In this talk, we propose the Distance Precision Matrix which builds on distance covariance, a measure of possibly non-linear association due to Szekely and Rizzo. Computing networks using the Distance Precision Matrix allows to discard indirect associations even in the presence of non-linear interactions. The method can also be adopted to small-sample situations.
No abstract available.
Thursday, June 29th, 2017
In almost every field in genomics, large-scale biomedical datasets are used to report associations. Extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. Here, we propose a new method to allow joint analysis of multiple studies. Given a set of p-values obtained from each study, the goal is to identify associations that recur in at least $k>1$ studies while controlling the false discovery rate. We propose several new algorithms that differ in how the study dependencies are modeled, and compare them and extant methods under various simulated scenarios. The top algorithm, SCREEN (Scalable Clusterbased REplicability ENhancement), is our new algorithm that works in three stages: (1) clustering an estimated correlation network of the studies, (2) learning replicability (e.g., of genes) within clusters, and (3) merging the results across the clusters. When we applied SCREEN to two real datasets it greatly outperformed the results obtained via standard meta-analysis. First, on a collection of 29 case-control gene expression cancer studies, we detected a large set of consistently up-regulated genes related to proliferation and cell cycle regulation. These genes are both consistently upregulated across many cancer studies, and are well connected in known gene networks. Second, on a recent pan-cancer study that examined the expression profiles of patients with and without mutations in the HLA complex, we detected a large active module of up-regulated genes that are both related to immune responses and are well connected in known gene networks. This module covers thrice more genes as compared to the original study at a similar false discovery rate, demonstrating the high power of SCREEN.
This talk deals with finding self-similar trends in a long sequence presented as a stream. We focus on finding exact trends, as well as those that might have a small number of errors in the forms of bit flips while using a single pass and space polylogarithmic in the input size.
Single-cell RNA sequencing (scRNA-seq) is a powerful technique that enables researchers to measure gene expression at the resolution of single cells. Because of the low amount of RNA present in a single cell, many genes fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zero-inflated negative binomial model (ZINB-WaVE), which leads to low-dimensional representations of the data that account for zero inflation (dropouts), over-dispersion, and the count nature of the data. We demonstrate, with simulations and real data, that the model and its associated estimation procedure are able to give a more stable and accurate low-dimensional representation of the data than principal component analysis (PCA) and zero-inflated factor analysis (ZIFA), without the need for a preliminary normalization step.
This is a joint work with Davide Risso, Fanny Perraudeau, Svetlana Gribkova and Sandrine Dudoit
I have been exploring the hypothesis that the best integer programming solvers can largely be used as push-button black boxes for many significant problems in computational biology. If so, ILP would become a more accessible tool for biologists without a deeper background in computation and programming. The experimental results are generally positive, but with some exceptions. I will report on these results and outline future experiments.
In this talk, we discuss some of the advantages of read-cloud based sequencing technologies (such as the 10X Genomics System) over standard short read sequencing technologies with applications to genome re-sequencing and assembly. We present theoretical bounds for the ability of these technologies in calling variants in repeat regions of the genome, and novel ways of aligning read clouds to the reference genome and rescuing variants. We also discuss potentials of these technologies for the purpose of metagenome binning and assembly.
Given an alphabet and integers k and L, a universal hitting set is a set of k-mers so that every possible L-long string over the alphabet must contain as a substring a k-mer from the set. We provide complexity analysis and efficient algorithms to construct small universal hitting sets. We also explore the relationship of such sets to minimizers, which have been adopted in many deep-sequencing algorithms, and show the advantage of universal hitting sets.
Joint work with Yaron Orenstein, David Pellow, Guillaume Marcais and Carl Kingsford
Tremendous energy and resources have been expended to sequence tumors in an effort to identify drivers of metastasis or the lethal phenotype. We applied genome and transcriptome sequencing to directly address this problem in patient derived prostate tumor xenografts well characterized model systems. Extensive analysis of the tumor contents failed to reveal any mutations or gene expression linked to the metastatic phenotype. However, analysis of the stromal compartment revealed a gene expression signature predictive of metastasis in the xenograft models and in human cohorts. This is particularly interesting because in these patient derived xenografts the human stoma is replaced by mouse stromal cells. The implication is that the human tumors actively educate the mouse stroma to provide a permissive environment for metastasis to occur. A further implication is that tumor focused genome studies may have missed half the story and that therapies targeting only the tumor may be missing the mark. These and other problems with precision oncology will be discussed.
Friday, June 30th, 2017
DNA methylation is one of the effective indicators to measure epigenetic changes, thus can be used to predict disease onset. Early diagnose of cancers is clinically essential to successful cure. It is, however, extremely challenging because of no or lack of symptoms that people easily overlook, thus the most critical time for treatments are frequently neglected. It is fundamental to detect a tumor as early as possible to provide proper treatment at the right time. Here we propose a novel method to select the most predictive methylation biomarkers by analyzing DNA methylation pattern in three specimen; tumor and normal cells of patients and whole blood from healthy people as background noise for liquid biopsy.
For this purpose, we developed MethylMarker to process bisulfite sequencing reads, identify differentially methylated region across samples in single CpG resolution, extend a single CpG to multiple CpGs to meet assay development criteria, screen out assay candidates that are likely to alarm false-positive errors by factoring in methylation pattern from whole blood and run supervised and unsupervised machine learning algorithm to select the most predictive assay candidate CpG loci. The package also provides enhanced visualization, so one can confirm the recommended candidates through graphical based approach. As such MethylMarker provides predictive candidate CpG regions that are diverse, sensitive and robust so as to develop the
diagnostic and clinical assay. We exercised MethylMarker to tumor/normal tissues from early stage colorectal cancer patients and whole blood bisulfite sequencing data from healthy people. We identified tens of DNA methylation candidate loci, which successfully predicts tumor and
normal with ~95% accuracy.
Cancer is an evolutionary process driven by somatic mutations that accumulate in a population of cells that form a primary tumor. In later stages of cancer progression, cells migrate from a primary tumor and seed metastases at distant anatomical sites. I will describe algorithms to reconstruct this evolutionary process from DNA sequencing data of tumors. These algorithms address challenges that distinguish the tumor phylogeny problem from classical phylogenetic tree reconstruction, including challenges due to mixed samples and complex migration patterns.
Cancer progression is an evolutionary process characterized by the accumulation of mutations and responsible for tumor growth, clinical progression, and drug resistance development. We discuss how to reconstruct the evolutionary history of a tumor from single-cell sequencing data. The tumor phylogeny problem is challenging because of sequencing errors and the high rate of allelic drop-out in single cell whole-exome sequencing experiments. We present a probabilistic model and a Markov Chain Monte Carlo approach to learn tumor phylogenies from such data. We use the model to develop a statistical test of the infinite sites assumption, which is frequently made in cancer evolution. We find that the infinite sites assumption is often violated by back mutations and sometimes also by parallel mutations.
The analysis of the mutational landscape of cancer, including mutual exclusivity and co-occurrence of mutations, has been instrumental in studying the disease. We hypothesized that exploring the interplay between co-occurrence, mutual exclusivity, and functional interactions between genes will further improve our understanding of the disease and help to uncover new relations between cancer driving genes and pathways. To this end, we designed a general framework, BeWith, for identifying modules with different combinations of mutation and interaction patterns. We focused on three different settings of the BeWith schema: (i) BeME-WithFun in which the relations between modules are enriched with mutual exclusivity while genes within each module are functionally related; (ii) BeME-WithCo which combines mutual exclusivity between modules with co-occurrence within modules; and (iii) BeCo-WithMEFun which ensures co-occurrence between modules while the within module relations combine mutual exclusivity and functional interactions. We formulated the BeWith framework using Integer Linear Programming (ILP), enabling us to find optimally scoring sets of modules. Our results demonstrate the utility of BeWith in providing novel information about mutational patterns, driver genes, and pathways. In particular, BeME-WithFun helped identify functionally coherent modules that might be relevant for cancer progression. In addition to finding previously well-known drivers, the identified modules pointed to the importance of MTOR and MED23 genes as well as the interaction between NCOR and NCOA3 in breast cancer. Additionally, an application of the BeME-WithCo setting revealed that gene groups differ with respect to their vulnerability to different mutagenic processes, and helped us to uncover pairs of genes with potentially synergistic effects, including a potential synergy between mutations in TP53 and metastasis related DCC gene. Overall, BeWith not only helped us uncover relations between potential driver genes and pathways, but also provided additional insights on patterns of the mutational landscape, going beyond cancer driving mutations.