Monday, May 9th, 2016

10:30 am11:25 am
In a graph orientation problem, one is given an undirected graph and a set of source-target vertex pairs. The goal is to orient the edges of the graph so that a maximum number of pairs admit a directed source-to-target path. In this talk I will describe different versions of the graph orientation problem, discuss their complexity and approximability, and present efficient integer linear programming formulations for them. I will demonstrate the utility of these formulations in orienting protein-protein interaction networks to reveal their underlying structure.
11:25 am12:00 pm
 
Modeling RNA Folding by ILP - Yelena Frid
 
The basic RNA folding problem translated to maximizing the number of non-crossing complementary nucleotide  pairs, can be solved efficiently by dynamic programming, but the DP becomes more involved when we try to model more complex features of an RNA fold.Then, an ILP approach is more flexible and promising.  In this talk I show how to model RNA folding with nucleotide stacks and pseudoknots, using integer linear programming.
 
Trying to Simulate the History Bound by ILP - Julia Matsieva
 
The History Bound is a lower bound on the number of recombinations needed in an Ancestral Recombination Graph, to derive an input binary matrix. The History Bound was first defined only via an algorithm which had super-exponential running time. That time was improved to exponential via dynamic programming. However, as is common in DP, the worst-case and best-case times are the same. We have developed two ILP approaches to computing the History Bound, in hopes that, unlike DP, typical executions would be more efficient than worst-case. In this talk I will detail an effort that expresses the original (but optimized) algorithm as an ILP.  This is work in progress (translation - so far, it doesn't work very well).
 

 

1:15 pm1:40 pm
We describe two biologically motivated problems where ILP seems to save the day.  The first is searching for the presences of a protein complex that is known in one species in another species network, using homology but no topology. Here an interesting application of connectivity in ILP comes up.  The second arises in cancer genomics: given two copy number profiles of a chromosome, find their common ancestor such that the total number of segmental duplications and deletions between the ancestor and the two profiles is minimum. Here ILP formulation is rather straightforward, but a more compact instance can be formulated.
 

 

1:45 pm2:10 pm
DNA bulk sequencing of multiple samples from the same tumor provides data to analyze the process of clonal evolution in the population of cells that give rise to a tumor. We formalize the problem of reconstructing the clonal evolution of a tumor using single-nucleotide mutations as the variant allele frequency (VAF) factorization problem. We derive a combinatorial characterization of the solutions to this problem and show that the problem is NP-complete. We derive an integer linear programming solution to the VAF factorization problem in the case of error-free data. We extend this solution to a mixed integer linear programming formulation for real data with a probabilistic model for errors. The resulting AncesTree algorithm is better able to identify ancestral relationships between individual mutations than existing approaches, particularly in ultra-deep sequencing data when high read counts for mutations yield high confidence VAFs.
 
2:15 pm2:40 pm
For many important complex traits, Genome Wide Association Studies (GWAS) have only recovered a small proportion of the variance in disease prevalence known to be caused by genetics.  The most common explanation for this is the presence of multiple rare mutations that cannot be identified in GWAS due to a lack of statistical power.  Such rare mutations may be concentrated in relatively few genes, as is the case for many known Mendelian diseases, where the mutations are often  compound heterozygous (CH). Due to the multiple mutations, each of which contributes little to the prevalence of the disease, GWAS also lacks power to identify genes contributing to a CH-trait.  In this work, we address the problem of finding genes that are causal for CH-traits, by introducing a discrete optimization problem, called the "Phenotypic Distance Problem".  We show that it can be efficiently solved on realistic-size simulated CH-data by using integer linear programming (ILP). The empirical results strongly validate this approach.
 
3:00 pm3:25 pm
A common task in phylogenetics is to deal with a set of topologically different phylogenetic trees. One problem formulation is viewing these trees as displayed in a single underlying directed acyclic graph (or network). Here, we say each given tree is displayed in this network if we can obtain this tree by discarding some edges (while keeping all the leaves) of the network. The most common optimization objective for this problem is finding the most parsimonious network that displays all the given trees. This optimization problem is known to be NP complete even when there are only two trees. In this talk I will present my research on finding optimal network for displaying phylogenetic trees. I will start from a perhaps more well-known problem: computing the rooted subtree prune and regraft distance between two trees (which is in fact closely related to the problem of displaying these two trees in a network). Here I will show that there is a simple ILP formulation for this problem. I will then present ILP-based approaches for finding the optimal networks that display multiple trees.
 
 
3:30 pm3:55 pm
Cancer is a disease driven from somatic mutations appearing in an individual’s genome. One of the main challenges in large cancer studies is to identify the handful of driver mutations responsible for cancer progression and development among the hundreds or thousands of mutations present in a tumor genome. Recent approaches have shown that to identify driver mutations it is important to consider mutations in the context of interaction networks. A previously proposed formulation requires to find connected subgraphs of a large protein-protein interaction network that are mutated in a large number of patients, with a subgraph being mutated in a patient if at least one of its genes is mutated in the patient.  We study ILP formulations for the exact solution of the combinatorial problem of finding subnetworks containing mutations in a large fraction of cancer patients. We show that a branch and cut algorithm can solve such problem even faster than previously proposed greedy and approximation algorithms. We tested our algorithm on cancer data and show that our approach is viable and allows for the identification of subnetworks containing previously reported cancer genes.
 
 
4:00 pm4:25 pm
Computing accurate and robust organizational patterns of chromosome territories inside the cell nucleus is critical for understanding several fundamental genomic processes, such as co-regulation of gene activation, gene silencing, X chromosome inactivation, and abnormal chromosome rearrangement in cancer cells. The usage of advanced fluorescence labeling and image processing techniques has enabled researchers to investigate interactions of chromosome territories at large spatial resolution. The resulting high volume of generated data demands for high-throughput and automated image analysis methods. In this talk, we introduce a novel algorithmic tool for investigating association patterns of chromosome territories in a population of cells. Our method takes as input a set of graphs, one for each cell, containing information about spatial interaction of chromosome territories, and yields a single graph that contains essential information for the whole population and stands as its structural representative. We formulate this combinatorial problem as a semi-definite programming and present novel techniques to efficiently solve it. We validate our approach on both artificial and real biological data; the experimental results suggest that our approach yields a near-optimal solution, and can handle large-size datasets, which are significant improvements over existing techniques